When a large hospital’s computer system went down at two in the morning, essential information was at stake.
Most organizations cannot afford this kind of interruption — certainly not one that’s responsible for caring for patients’ health. Perhaps the risks are not as high in your company. Even so, operations need to run smoothly to maximize revenues and profitability. So it’s critical to monitor the right data and metrics regarding the performance of servers, storage and your storage area network (SAN). Having this information at your fingertips enables you to take preventive action and avoid slowdowns and shutdowns.
The Right Metrics and Data
To understand how well your servers are functioning, you need to monitor CPU and memory usage, network and adapter performance, and disk space. For storage, the top areas to watch are throughput, IOPs, and latency. Lastly, for SAN, you want to keep an eye on throughput, latency, and errors. Underneath each of these umbrella areas, you need to be able to drill down to a subset of important metrics.
What you decide to monitor most closely depends on your business needs. For example, if you have a CPU intensive application, you’ll likely want to keep a close watch on CPU usage. Since mid to large-sized organizations have many varied applications running at the same time, there are probably multiple metrics that are important for your organization to track.
The Big Picture
The nature of information technology is that it is interconnected. And yet corporations often manage it as if each server and storage device exists on an island. When you monitor technologies in isolation, however, you miss pieces of the puzzle that are essential to problem-solving. That’s because any part of your IT infrastructure might cause a bottleneck that creates a problem somewhere else.
For instance, if a server administrator watches servers but no one oversees the network they’re connected to, complications can occur. If five of the servers have a network problem, the administrator may have to deduce that it is not a server issue. That’s because it’s unlikely for all the servers to fail at the same time. Taking it a step further, the admin might infer that the problem lies in the network to which all the servers are connected — a difficulty that may have developed because no one is monitoring the network.
The same dependencies can affect storage devices. If your company has two or three storage devices which all show high latency, for example, you might question what’s going on. Because it’s highly improbable that all three devices have the same problem concurrently, you naturally consider whether there’s an issue with the SAN fabric. You’re coming to this conclusion indirectly, however, which takes time. Also, if no one is monitoring SAN, you still may not know where the problem resides.
These examples illustrate that to approach problems in an intelligent way, you need to monitor servers, storage, and SAN simultaneously, immediately pinpointing bottlenecks rather than using your powers of deduction or the process of elimination.
Spikes and Dips from the Norm
Whatever metrics you determine to be important, the key is to monitor them over time, observe the regular patterns, determine what impacts performance, and spot deviations as they occur.
If you typically use half a server’s memory, for instance, and usage starts to inch up day after day, you may predict that you will run out of memory in a few weeks or month. With that insight, you can take action today to fix the memory leak before there’s an outage, perhaps by killing a process or rebooting the system. Being proactive rather than reactive saves your company time and money and makes your life easier.
Another problem that weighs heavily on IT infrastructures is organizational growth. As your business expands, there will be more demand placed on its servers and storage. If you don’t make adjustments to deal with the increased usage, you will hit a wall one day. Instead of blindly driving towards it, you need to see it on the horizon as it emerges ahead of you.
If you cannot project the future, you run the risk of running out of CPU capacity for example. You cannot instantly expand CPU. You need to attain a quote, buy the product and have it shipped, installed, and configured. All that could take a month. The same goes for storage. As your organization grows, there will be more demand on throughput and IOPs. In turn, latency will increase, and this may slow down business processes.
The Right Tools Deliver the Right Metrics
To monitor your whole IT infrastructure at one time, you need the right tools. Otherwise, it can take you days, weeks or months to find and resolve a problem. All the time, your business is impacted as IT associates spend thousands of hours trying to locate the problem. At the same time, part or all of the organization is handicapped in conducting business.
To learn how Galileo’s cloud-based infrastructure performance management tool can help you, schedule a demo now.