May 20 2008

What Keeps You Up At Night

IT managers share some of their worst nightmares and how to prevent them.

Photo: Shane van Boxtel
Grenzebach Glier CIO Tony Daniels says unforeseen incidents — such as a notebook that leaves the office and never comes back — make him nervous.

Worry is part of his job description, says Tony Daniels, CIO at fundraising consultancy Grenzebach Glier and Associates in Chicago. There are several hundred work-related scenarios that give him pause, but only a few of them qualify as genuine nightmares.


“Technology is essential for companies and makes so many things run better,” says Daniels. “But the other side is that IT managers always have to be thinking about risk — and there’s a lot to think about that can actually be pretty scary.”

The demons that haunt IT shops come in many shapes and sizes, ranging from obvious catastrophes (such as fires and floods) to seemingly minor lapses (a missed backup or weak password) that can lead to major calamities. The imperative to prevent those nightmares becomes more urgent as small and midsize companies become increasingly dependent on IT.

“We have all of our claims and eligibility data on our claims-processing servers,” says Todd Henson, director of IT services at Prairie States Enterprises, a third-party medical claims administrator based in Chicago that handles benefits for self-insured companies. If the transaction servers went down, business would stop.

“We live and breathe and die by our systems,” says Mark Hargrove, COO and CIO of Fresh Produce Sportswear in Boulder, Colo., a clothing manufacturer and distributor. “We have passed the point where we can conduct business if we don’t have our systems up and running. We can’t take orders, we can’t ship products, we can’t allocate inventory — we come to a standstill.”

Calculating what’s at stake as he devises ways to prevent disaster is pretty straightforward, says Martin Szalay, director of IT at Food Warming Equipment in Crystal Lake, Ill.

“I think about it as 130 people’s jobs and $30 million worth of transactional data that I’m responsible for,” Szalay says.

Adding to the pressure, IT managers at small and midsize companies usually have fewer resources to help them head off or mitigate disaster. Many expensive and labor-intensive preventive measures are simply out of reach for many SMBs, leaving their IT managers to fret and come up with creative alternatives.

Photo: Todd Winters
"If our servers went down we'd have some real problems."
— Todd Henson, Prairie States Enterprises

Finding technology to provide the security and redundancy her company needs while staying within her IT budget is an ongoing struggle, says Kelly Johnston, senior vice president of technology and product development at Health Advocate in Plymouth Meeting, Pa. The company helps the employees of its business customers navigate the complexities of the health-care system.

“For us it feels constant — we never have enough servers, enough backup appliances, enough VPNs,” says Johnston. “We’re always trying to find solutions that function well but aren’t too expensive. It’s a balancing act, and losing the balance can have serious consequences.”

Maintaining that balance at most SMBs are lean IT staffs made up of a few generalists who perform a wide array of tasks that stretch their expertise to the limit.

“Part of the challenge, and the fear, is not having the dedicated skill set on staff to solve some problems easily,” says Robert Booth, a senior systems engineer at Knowledgeable and Innovative Technical Solutions (KITS) in Round Rock, Texas. “You definitely have to be proactive about prevention and think about strategies for when something really bad happens.”

IT managers we spoke to shared five nightmare scenarios — and some advice about how to prevent them from becoming all too real.

The Small Security Slip

It’s not necessarily malevolent intruders looking to rampage through his network that gives Grenzebach Glier’s Daniels the most pause. Information-security risks often appear in the guise of accidents or commonplace breaches of policy or good practice that seem innocuous.

“There’s the notebook that walks out of the office and never comes back — that’s the sort of thing that makes me anxious,” Daniels says. “When you think of eliminating security risks, you must consider people, process and systems. The temptation is to focus only on the systems, which is probably the wrong place to start.”

Preventing nightmares from becoming reality means identifying business processes and user practices that create risk — such as the proliferation of unencrypted USB keys and recordable media in the workplace — and mitigating the potential threat, says Daniels.

Some of the important technologies, policies and practices Daniels put in place at Grenzebach Glier to protect data include token-based authentication, use of pass-phrases instead of passwords (phrases are harder to crack and easier for legitimate users to remember), keeping home computers off the company’s VPN by restricting access to company-owned machines, encrypting disks and using secure FTP protocols to transfer sensitive information.

Disappearing Data

Any veteran IT manager has faced plenty of crises, but there’s one problem that sparks the ultimate terror, says Food Warming Equipment’s Martin Szalay.

“It’s the data, or rather, losing data,” he says. “I can replace a server or I can run out and get a new switch or a hub. I can research an application and I can even get a developer in India at 2 a.m. to do up a program, but I can’t replace your data.”

Photo: Michael O'Brien
"We try anything we can think of that can go wrong."
— Robert Booth, Knowledgeable and Innovative Technical Solutions

There are many ways to lose data — servers crash, fires and floods happen — and miraculous restorations are rare. Large vendors such as Microsoft can help with recovery software, and Szalay has even used forensic software to restore data after catastrophes. But the only sensible strategy to avert disaster from data loss is relentless and redundant backup.

“What often happens is that people lose data and then discover that they haven’t been backing up or their backups are weak,” Szalay says. “Companies have the same backup systems in place they had two years ago, but of course they’ve got new folders or new databases that aren’t part of that system.”

Besides keeping the company’s backup plan up to date, Szalay recommends “backup of backups” on a variety of media. For instance, he makes a backup copy of the company’s SQL transactional data to a server, to external disk and to tape. Frequency is important, too. Szalay runs hourly backups of critical data, as opposed to the daily or weekly schedules many other companies use.

“I’m really, really paranoid,” Szalay says. “There’s no such thing as too many backups or doing them too often.”

Henson of Prairie States agrees, adding that one major priority for his company is to move off tape backups. He says part of the company’s new disaster recovery plan is to deploy network-attached storage with external storage drives. Henson says the company’s goal is to have a more dependable backup site and eliminate the time-consuming tape-backup process. He says anything is better than the existing system in which the company’s IT manager takes tape home every night.

Malicious Software Attacks

Fresh Produce’s Hargrove has already had a close encounter with a security breach that seized control of his data center. In January 2007, the company was invaded by a “rootkit” and other viruses that infected all of the company’s servers and many of its desktops. A rootkit is malware that surreptitiously alters an operating system so an unauthorized user can take control of it.

Once Fresh Produce’s systems had been breached, the marauding rootkit disabled any residual antivirus protection on the company’s network. Proving that trouble often comes in multiples, serious and difficult-to-diagnose hardware problems in the company’s server room complicated the crisis.

“It was beyond awful,” Hargrove says. “We had very sporadic operation of our systems — they would be up for a few hours and then down for a few hours for two months. The IT people were here 16 hours a day for a month and a half.”

Hargrove and his staff finally beat back the attack by isolating the company’s network from the Internet for several days and cleaning out the infection from every corner of the infrastructure.

The hardware problems were eventually traced to faulty power distribution in the server room and were fixed by isolating it from the power grid and installing a higher-capacity uninterruptible power supply.

For prevention, Fresh Produce upgraded and increased the number of firewalls at the network’s edge. Core databases and applications are now shielded by at least two firewalls. Hargrove installed new antivirus software and now rotates different products regularly.

“There are strengths and weaknesses to each of the vendors,” Hargrove says. “I think we’re going to change vendors every year now — we’ll get the strongest-rated package in that particular year.”

Although Fresh Produce did not lose data as a result of the infection, Hargrove has also installed a new tape library to speed and upgrade backup — just to be sure.

Communication Breakdowns

Health Advocate uses a hosted call-center application to operate an around-the-clock phone service for employees of client companies. If the host goes down or the connection is lost, Health Advocate is out of business until it’s restored.

“We connect to the application through a secure Internet connection,” says Health Advocate’s Johnston. “I spend a lot of time worrying about the World Wide Web. That pipe can take 32 hops before it makes our connection. Something might be happening in Chicago that has nothing to do with you as you try to make your way to Cincinnati, and yet Chicago has huge latency problems and you end up not connecting.”

One way Johnston deflects the potential instability of the Web is by deploying a point-to-point VPN between her company’s headquarters and the host’s data center.

“We’ll obviously still be using the Internet, but we’ll be more insulated from random problems,” she says.

In the case of an outage at the host site or an extended loss of connection, Health Advocate maintains a call-management application on its data center that can be pressed into service in conjunction with the company’s PBX.

“It means you’re paying for licenses and you’re not using them — at least you hope not — but our clients deserve some assurances that we’re doing what we can to make sure our 24 x 7 service is just that,” says Johnston.

Mail-Server Failures

Driving into work and finding the building reduced to rubble is his worst nightmare, but the failure of a mail server is not far behind, says KITS’ Booth.

KITS has several mail servers, including a BlackBerry enterprise server, which causes the greatest consternation of all when there’s a problem, says Booth.

“It’s amazing how important these things are to the people who use them,” he says. “E-mail of all sorts is a critical application, but the BlackBerry makes it more urgent.”

KITS uses thorough backup procedures, so individual messages are rarely lost, but e-mail delays can be almost as frustrating for users as e-mail that’s disappeared.

Booth’s strategy has been to minimize the number and duration of real outages by simulating potential failures on virtual machines built on VMware ESX servers running in the KITS blade center.

“We document and prepare for anything,” he says.

The VMware has also been a useful tool in dealing with spam. The virtual environment has allowed Booth to run attacks on a simulation of the KITS network to see if the defenses in place really work.

“You don’t really want to mess with the actual shield on your mail server,” says Booth. “If you put the wrong configuration information in a production system, the urgency tends to go up pretty quickly.”