Jan 22 2012

Data Deduplication: The Case for Post-Process Deduping

There are real, distinct performance advantages to post-process deduplication with a grid architecture.

When making the decision for a disk-based backup deduplication solution, how do you evaluate the tradeoffs between post-process deduplication with a grid architecture and inline deduplication with a fixed controller?

Marc Crespi, vice president of product management for ExaGrid Systems, sees the benefits of grid deduplication in three key areas:

Highest Performance for Shortest Backup Window

Post-process deduplication in a grid with full servers offers the fastest backups because the system deduplicates data after it has landed to disk and because full servers bring CPU, memory, disk and Gigabit Ethernet. Post-process also enables the fastest restores because the disk backup system keeps a full copy of the most recent backup available in high-speed cache for immediate recovery. In contrast, with inline deduplication, the disk backup system performs the dedupe process before data is fully protected on disk, and for a full system restore the data must first be “rehydrated.”’

Performance Maintained as Data Grows

Grid architecture solutions maintain high performance as the disk backup system scales because you add full appliances including processor power, memory, bandwidth and disk matched to the amount of backup data. When the system needs to expand, additional full appliance nodes are attached to the grid, thereby maintaining all aspects of performance as data grows. With the inline (controller/disk shelf) model, all of the processing power, memory, and bandwidth are contained in the controller, so when data increases and IT staff expands the system by adding only disk shelves, backup performance degrades.

Control Costs At Scale

Disk backup with deduplication systems based on a grid architecture are the most cost-effective to scale because as data grows, full servers can be seamlessly added to the grid in modular increments as needed without replacing existing nodes. Grid capacity is typically load-balanced automatically, which maintains a virtual pool of storage that is shared across all nodes. This contrasts the controller-disk shelf model, which adds disk to a fixed-capacity controller as data grows resulting in an expansion of backup windows. In this scenario, the controller must eventually be replaced via costly forklift upgrades to the next larger controller.

For more on the benefits of grid-based deduplication, read Crespi’s post on Data Center Knowledge.