“We’re past the size where we can say, ‘Let’s go down to a retail store and buy another box,’ ” says Greg Taffet, CIO for U.S. Gas & Electric.

Jun 30 2009

Masters of Disaster

Business continuity means fortifying your IT infrastructure to withstand whatever the day may bring: power outages, natural disasters, system crashes.

Keep the business running. Keep the business running. Keep the business running.

That mantra ultimately represents the “why” behind any and all disaster preparedness measures that businesses take.

Keeping the business running admittedly seems obvious, but it’s hard to know exactly how your network and systems will react in a crisis, says Greg Taffet, CIO for U.S. Gas & Electric (USG&E), which has customers in five states. “While we might be able to get support from a vendor, we will also have to think on our feet and deal with workarounds ourselves,” he says.

Understanding that reality, USG&E’s IT team spent the past six months migrating to a virtualized systems infrastructure that has business continuity as a core capability. “By putting my test boxes on virtual machines and then testing the VMs as the fail-over for production, I found I need fewer total boxes,” Taffet says. “For a lower price and total cost, I can keep the business running.”

Richard Jones, vice president and service director for data center strategies at the Burton Group, says IT managers began to focus more on disaster recovery following hurricanes Katrina and Ike.

“Coupled with the commoditization of resiliency technologies, some of the more progressive companies have started to improve recovery and resiliency,” Jones says.

Exercise Your Options

Jones points to a handful of technologies that have helped companies improve their disaster recovery posture: iSCSI storage area networks; virtualization with live migration; low-cost JBOD (just a bunch of disks) storage arrays; and low-cost, high-bandwidth Internet connectivity. “The problem has been that only the more progressive small businesses have taken advantage of the opportunities. Many just haven’t considered that there are now options that didn’t exist as little as three years ago,” he says.

What are some of those technology options? How can they help your company keep its doors open during a disaster and long after so that you can recount the tale of your tech team’s heroics? Is there a trick to justifying the investment in these tools and calculating total cost of ownership?

The thing to realize and capitalize on, say IT chiefs like Taffet, is that most of these technologies improve a company’s day-to-day operations as well. For instance, by running VMs, USG&E can scale up as it grows. It expects to begin supplying natural gas in a sixth state this year, and to swell its workforce from 100 to 200 employees by the end of next year. Over that same period, Taffet likely will add few if any new hires to his team of 10. “It’s definitely one reason to automate more and not have to increase IT by that much.”

Beyond virtualization, other tools that offer protection from disaster and do double duty for daily operations are thin clients coupled with remote access, backup and archival storage, and uninterruptible power supplies. Here’s a closer look at how to take advantage of these technologies.

Virtualization Vision

The energy supply business requires that USG&E provide data to its trading partners — the producers of the energy resources that it supplies in New York, Michigan, Indiana, New Jersey and Ohio — on a daily basis. There are multiple hard deadlines each day, Taffet says.

Those deadlines made the spending justification for the virtualization infrastructure simple, says Taffet. “This is really a consideration for business continuity, not TCO. We’re past the size where we can say, ‘Let’s go down to a retail store and buy another box.’ ”

The main production environment comprises 13 physical servers that host SQL Server, Microsoft Exchange and file servers. The USG&E data center is an all-HP shop of various rack-mount ProLiant systems. In addition, the company recently set up shop at a collocation facility in Fort Lauderdale, Fla., that will soon house six more HP ProLiant servers to provide redundancy for the headquarters facility in Miami.

Taffet’s team has virtualized all its development servers and a significant number of low-utilization support servers using VMware Server. “We found that we can virtualize multiple production machines into virtual snapshots on a VM and that it actually works,” he says. “Performance is not the same as running one production machine on one piece of physical hardware, but the plan does not call for having all our users tapping all their applications during a disaster.”

Monitoring and testing the VM load is just simple arithmetic, says Taffet. Here’s what to do: Add utilization curves together, look at peak utilization and the curves of performance, and determine if the number is still within the tolerance range.

USG&E expects the load on its virtualized environment to spike during a disaster. Taffet recommends working with the business side of the house to prioritize application services and then running test scenarios to calibrate acceptable tolerances in processing delays for specific services.

As an example, he pointed to nightly backups. “Now, they are much more sequential, not all starting at 2 a.m. We didn’t need to rearrange the user jobs so much; it was more IT jobs that had to be rearranged.”

Thin-Client Tactics

For FMSbonds, a brokerage house in Boca Raton, Fla., a Citrix Systems thin-client architecture provides a secure infrastructure for its vast storehouse of financial data and a fortified IT approach for its roving workers in the hurricane-prone region.

“Our branch offices connect utilizing the Citrix Secure Gateway Internet portal,” says FMSbonds Vice President and Chief Technology Officer John DeVine. “The technology is ideal for our business because of the high level of security, ease of deployment and reliability it offers.”

A user of thin clients for 12 years, FMSbonds has put its disaster plan into action more than a half-dozen times. The company runs a real-time, 24x7 replication of data over its network using Double-Take and has a backup facility in Ashville, N.C., that it can fail-over to on the fly. This setup lets it restore services in 25 minutes for all of its 130 employees, no matter where they are, if any of its data services go dark for any reason, DeVine says.

“The product we sell is reliability. We sell income. Our clients want to know that income is going to be in their accounts when we promised it would be,” DeVine says.

The need for a remote-access capability as a recovery component became apparent following 1992’s devastating Hurricane Andrew. “That one took the roof off our building in Miami,” DeVine says. After that, FMSbonds set a strategy to provide employees access to their complete desktop and all applications from any location.

Test, prepare and document are the bywords of business continuity, DeVine points out. Keeping that last item current requires knowing what’s going on beyond the company’s IT perimeter and hinges on the soundness of the business continuity plans of every single partner with whom FMSbonds interacts and exchanges data.

To meet that challenge, DeVine and his staff take a proactive approach: “Inside my Outlook calendar and the calendars of the people who work with me, we have a recurring list of contacts that we go through every month. We verify logins, IP addresses, backup virtual private network structures, anything. It’s not too techie, but it works. Plus, if you have to reach out in a crisis, they’re more likely to pick up the phone and talk to you. It humanizes the system a little.”

Backup and Storage Strategy

Keeping the network live during a crisis will prove irrelevant if employees can’t tap the data they need to do their jobs. That’s where backup and storage strategies come into play. For Concerro, a software-as-a-service company in San Diego that provides web-based workforce management software and services to healthcare organizations, the solution has been to back up its systems to a high-availability storage area network.

Its fault-tolerant HP LeftHand storage units use hardware RAID 5 and network RAID 1 to parcel out customer data across multiple disks in multiple storage modules using hot-swappable Serial-Attached SCSI drives (SAS). “We can lose several drives across multiple storage modules or an entire storage module — a 12-disk system — without any loss of data or incurring downtime to clients,” says Rod Longanilla, senior manager for IT. “Our main clients are hospital facilities, and quick data recovery in the event of a disaster is an automatic standard with our service.”

The SAN archives files to a second storage server at a remote location in case of a complete host corruption or site disaster.

Although Concerro custom-built the backup apps it now uses, the company has begun evaluating commercial tools, such as Riverbed Technology optimization programs, to help it make more efficient use of bandwidth and storage space. “We are currently working on better data management to eliminate backing up redundant data,” he says. “Backing up terabytes of data and archiving it on a remote storage unit takes up considerable bandwidth.”

Backup data verification is a well-known yet often overlooked step to ensure data integrity and recovery, says Longanilla.

Disaster recovery would quickly turn to a complete disaster if the backup or archived data is corrupted,” he says. IT should establish procedures to verify data following the initial backup, as well as create a periodic procedure to do both manual and automated checks of archived data.

Power Plan

No power, no processing. In the end, a business must think through exactly which systems it can’t live without, and then devise a plan for powering those systems for minutes, hours, maybe even days.

The uninterruptible power supply is the first stopgap in eliminating single points of failure from power outages and surges and is therefore extremely crucial to availability of the data center.

The role of the UPS certainly depends on the level of importance of the data center it is protecting, points out Matt Kightlinger, director of solutions marketing for Liebert Products at Emerson Network Power. “The UPS plays a major role, but is not the only consideration in disaster scenarios,” he says. “Other considerations when designing critical power systems include the amount of battery backup time required, an adequate generator and an automatic transfer switch for extended outages.”

As with other IT components, it’s wise to ensure redundancy for power tools. “Specifically for data centers under 2,500 square feet, there is a growing trend of N+1 UPS designs,” says Kightlinger. These designs integrate multiple, smaller UPS systems, removing single points of failure, he says.

Power gear also needs regular monitoring, servicing and testing, Kightlinger says. “IT or data center managers should be scheduling regular preventative maintenance on all components of the critical power system.”

Market Dynamics

Even in tight economic times, companies need to keep up with what’s happening in the disaster preparedness and recovery marketplace because these technologies continue to evolve rapidly, the Burton Group’s Jones says. His advice: “Reassess the products and solutions on the market once a year. Commoditization is bringing DR technologies into the reach of small businesses quickly.”

Josh Ritchie/Aurora

Become an Insider

Unlock white papers, personalized recommendations and other premium content for an in-depth look at evolving IT