A Simple Error Can Create a Big Problem
Information technology, particularly the security field, is often a thankless job. When information security works, there are no system compromises or malware outbreaks, but management and users alike typically fail to realize the efforts information security personnel have invested to ensure that security. Those same managers and users are quick to point fingers and place blame when something does go wrong.
Information security already has a negative reputation in many organizations, and there are enough external threats without creating opportunities for security breaches. With that in mind, it is even more important than usual for information security professionals to be diligent and ensure they do not make mistakes that lead directly to security issues.
There are many common mistakes. A couple of years ago, the SANS Institute cataloged a list of the top 10 security mistakes made by IT professionals. The list includes errors such as connecting unpatched or insecure systems to the Internet, providing username and password credentials over the phone without authenticating the caller’s identity, and not running updated antivirus software.
My Worst Mistake
The list from SANS can serve as a terrific baseline for catastrophic mistakes, but there is one in particular that stands out as the biggest mistake I have ever made. It occurred while I was working at a dot.com Web site as a jack-of-all-trades network administrator.
My job description included anything and everything related to IT short of actually developing the Web site. I had to rack and stack the equipment in the network server room. I installed, configured and administered all the servers, including the domain controllers, DNS servers, Web servers, e-mail servers and a bleeding-edge IP telephony and fax server. My role spanned troubleshooting and supporting user desktops to evaluating and procuring equipment and everything in between.
One function that fell into “everything in between” was to ensure our data was backed up daily. The Web site generated thousands of transactions an hour, and the transaction data plus the accumulated customer information was the lifeblood of the organization.
We invested a great deal of time and money to ensure we had the best tape backup system we could afford. The unit was slick, capable of holding multiple data tapes and robotically switching them out as one filled up to continue backing up without human intervention. We spent weeks learning about the hardware and the software and working to optimize our data backups so that we could back up our data as quickly as possible with as little impact as possible to our users or to the Web site.
We invested hours devising a policy for data retention and a schedule for removing tapes for offsite storage and replacing older tapes to ensure the integrity of the tape, making sure we did not miss a thing. The problem was that we missed a very major thing — validating the data on the tape and verifying that we could actually restore the data if necessary.
When a disaster finally struck, we learned the hard way. After a database server crash, we attempted to restore our most recent backup tape only to find out that it did not contain some of the key data we needed. Because we had never tested our ability to restore data, we were unaware that the backup agent was unable to work with files that were open for use, and that most of the time we were actually not backing up much of our most important data. After going deep into the archives, we were finally able to find a backup tape that let us get up and running, but we lost weeks of customer and transaction data in the process.
Learning From My Mistake
In today’s information security environment, you can expand the concept beyond data backups to disaster recovery and business continuity as a whole. Your plan may look good on paper and your day-to-day execution may appear to follow your process, but you need to test it to be sure. Schedule periodic dry runs or tests to walk through potential disaster scenarios and ensure that your plan will work in the real world. Waiting until the catastrophe strikes is not a good time to learn that it won’t.