Data Center

De-Duplication Is the New Word in Backup Technology

Critical data is growing at an exponential rate, and tapes are no longer the only option for backup. Data de-duplication technology (also called data reduction or commonality factoring) allows users to store more information on fewer physical disks than has been possible in the past, making the cost of disk backup competitive with tape.

“Although the technology is fairly new, de-duplication is becoming widespread,” says Stephanie Balaouras, an analyst at Forrester Research. “Right now, disk space is three to four times as expensive as tape, but de-duplication can reduce data that needs to be backed up by a ratio of 20 to one. The big question is whether this technology is what puts the last nail in the coffin of tape backup.”

As the name suggests, the goal of de-duplication is to eliminate redundant data from backups. The technology replaces duplicate copies with much smaller pointers to a shared record. This can take place at the level of either whole records or smaller unique data segments.

For example, if someone e-mails a 10-megabyte Excel file to 10 people on a network and each of them stores it, that translates into 100MB of backup disk space without de-duplication. With whole-record de-duplication, one copy would be stored along with 10 reference pointers. If, however, one of the users changes the name of the file or alters the contents in even the slightest way, the entire copy will be backed up. Using sub-record level de-duplication, only the changes to the altered file would be saved, with pointers to the original. Both de-duplication methods are usually used in conjunction with the traditional compression algorithms — standard backup tactics that reduce the space consumed on the backup disk.

The trend is toward subrecord level de-duplication. A wide range of systems that provide de-duplication are already available, such as Quantum’s DXi hardware or Cybernetics’ iSCSI SAN and software such as Veritas NetBackup PureDisk. Along with dramatically reducing backup storage space consumption, these technologies cut restore time and eliminate the need to wade through incremental backup tapes. Most systems allow users to restore back to a specific date and time, and some make decentralized backups possible.

Proceed With Care

Balaouras warns that while data de-duplication is fast becoming a standard feature in backup systems, the technology is new enough, and there are enough variations among applications, that buyers should proceed with care. A key distinction is whether the data reduction takes place at the source (the backup server) or the target (a virtual tape library or disk appliance). Source-based processing uses much less bandwidth and provides for either local or global backup, but it often requires users to replace their current backup systems or run one system for central office backup and another for remote locations.

Whether de-duplication occurs during or after data are processed is also a serious concern. Data reduction is very CPU-intensive and can slow down the backup. Performing the de-duplication, after an initial backup has been completed, however, requires more disk space and means that the data reduction must be completed before the next scheduled backup.

Scalability and data integrity issues raised by the number of times the data is processed by de-duplication and checking algorithms in most systems are also issues users should investigate before they buy, says Balaouras. But de-duplication is here to stay, and it’s accelerating movement toward disk backup, especially among SMBs without large investments in legacy tape systems.

“Tape will be around for a while — for one thing it’s got a better power and cooling profile than disk, and that’s important in today’s data center,” says Balaouras. “But data de-duplication is a reality — it will take some time to sort out the approaches, but it definitely changes the comparison with tape.”

IT Takeaway

To narrow your options, consider the following criteria:

• Location of the de-duplication — backup source or target

• Data integrity

• Scalability

•Maturity of the vendor offering (Some systems have included de-duplication for several years, but in others it’s a new feature.)

Jeff Gross is an IT manager at Tucker Industries in Bensalem, Pa.

textfield