The data center is the brain of an organization. Its loud, chilly confines house the neuron-like servers connected by a network of cables similar to the human brain’s own synapses. And just as critically as a brain, data centers must run 24/7 to adequately provide their services to the business. But to keep that always-on pace, data centers require constant monitoring.
There are many types of data center monitoring. Typical approaches include physical and logical security, environmental monitoring for temperature and humidity and — the focus of this article — service monitoring.
Service monitoring keeps tabs on the services your data center provides to your customers. It could be a network share connected to a file server for your employees, a web server farm powering an e-commerce site for retail customers, or a VPN connection for an important vendor.
In order to truly understand service monitoring, it’s important to remove the legacy IT mentality of defining monitoring points in terms of servers instead of services. That’s because today’s data centers could have tens of thousands of servers providing hundreds of applications and various services, creating a nearly infinite number of data points.
With today’s emphasis on virtualization, cloud computing, distributed applications and logical storage, the old model just doesn’t make sense anymore. Services are simply too spread out for any single monitor to make sense. Now, it’s all about conceptualizing sets of unit monitors to create service monitors.
So who cares if a single node goes down in a load-balanced cluster? The customer experience isn’t affected. What’s the big deal if a couple of disks go bad in the storage area network? There are dozens more in the SAN to keep the lights on. An organization should monitor the service, rather than the server, to eliminate false positives and prevent distrust in the monitoring solution. Each service can be thought of as a hierarchy of objects with dozens or hundreds of relationships throughout. For example, organizations typically provide a web interface to access internal email (an extremely simple concept), yet when broken into its logical pieces, it soon becomes anything but simple.
The service depends on reliable network connectivity, which moves requests from client desktops to the data center. Once there, requests hit a load-balancing appliance, which then divvies up the connections to a group of front-end servers that run on storage provided by a SAN with dozens of physical disks. The chain goes on and on.
Each of those objects plays a crucial role in providing the service to the customer. It’s critical to properly define each object and establish clear relationships and dependencies between each piece to obtain a holistic view of what it truly means for a service to be “up” or “down.”
The terms up and down are relative. For example, it is generally agreed that when a server is up, it can be pinged by another host. An IT admin may receive a complaint from a customer proclaiming a server is down. The admin pings the server, receives a response and calls the user back to let them know they are wrong. Essentially, this is a communication breakdown.
If we could sync the user’s and admin’s definitions of up, issues could be resolved sooner. That is where service monitoring comes into play. In this scenario, a user calls IT to report a server is down, and the admin asks what application the user was trying to access. The admin then checks that service — not a server — to see if a problem exists.
So how do we get to this point? According to terms used by the Microsoft System Center Operations Manager system, the answer involves creating numerous unit monitors and stitching them together with aggregate and dependency monitors.
A unit monitor can ping a server’s IP address, check whether a particular service is running or ensure that the level of disk resource utilization doesn’t climb too high. Once all unit monitors are defined, dependencies can be established between them. For example, if a firewall that sits in front of a server goes down, there’s no need to monitor the server.
Aggregate monitors are created by grouping various unit monitors together. This is the essence of true service monitoring. Aggregate monitors bring a new level of intelligence that traditional monitoring can’t offer, creating a smart monitor that alerts, not on every little blip, but only in situations that matter.
An aggregate monitor might group ping, CPU utilization, disk utilization, memory utilization and network utilization monitors for 10 servers inside a Citrix farm. One or more unit monitors may go off, but they send their distress to an aggregate monitor, rather than sending out alerts. The aggregate monitor then makes the decision about whether to alert.
When implementing an enterprise monitoring solution, remember that data centers are complicated. Don’t oversimplify the task of monitoring. Doing so inevitably leads to an abundance of unnecessary alerts, which eventually lead to distrust. When monitoring can’t be trusted, it becomes just another daily nuisance.