When systems slow down, employees are angry because they can’t get their work done, and revenues and corporate reputations are at stake. You’re in the spotlight and, sadly, not in a positive way.
IBM AIX server monitoring solutions promise to help you detect and troubleshoot server issues. But are they enough? Sometimes your server monitoring tool may fail to find the cause of the slowdown. Why? Not because it’s a weak tool; the issue could be hiding somewhere else in the IT environment.
To demonstrate the conundrum, let’s look at one of our customers who had been experiencing a slowdown on a project for a few days. To do the analysis like the one described, you need an infrastructure performance monitoring (IPM) tool that oversees your entire IT environment, includes an analytics dashboard and allows you to tag assets to create groups. In their case, they were using Galileo Performance Explorer®.
Using your IPM tool, use the tagging feature to consolidate the IT assets that support a project. The IT environment for this customer’s project included two storage subsystems and three servers.
To learn how the servers are performing, go to your analytics dashboard and run a report on the AIX servers that support the project. Since our customer had started to experience the sluggish activity during the last three days, we looked at data for that period. Immediately, we identified that two AIX hosts had begun to have problems with their read service times, climbing above the ten-millisecond threshold starting late on the evening of September 12, which coincided with the slowdown.
Reviewing the disk charts, we saw that late on the 12th the disk-read transfers on the first host plummeted while the disk-read service times skyrocketed.
But what was happening on the second AIX server? It turns out that it was exhibiting the same symptoms.
Because two hosts had shown the same problem at the same time, we concluded that the hosts were not the cause of the slowdown.
At this point, without an IPM that monitored their whole environment, we would have reached a dead end. There would be no quick answers. We determined, however, that the common denominators of the slowdown were the storage subsystems and SAN switches, suggesting that further investigation was in order.
Looking at the analytics for the storage subsystem, we learned it had had node errors at around the same timeframe as the servers started act up.
The ports had started averaging over 10,000 errors per minute at around eight o’clock on September 12.
This information led us to check the ports on our V3700 or the SAN switches for any issues. When the customer did this, they discovered that an employee had pulled the wrong cable from the SAN switch, accounting for the port errors.
Fixing the problem was easy, but finding it could have been a lot more challenging without the right tools. IBM AIX server monitoring alone would not have helped to resolve this slowdown. We needed to create a virtual group of assets that supported the project and to view the data and analytics that reflected their behavior. By doing so, we were able to zero in quickly on the issue and resolve it.