Performance Monitoring in HPC Environments: Keeping Pace with the Demands of Big Data

Performance Monitoring in HPC Environments: Keeping Pace with the Demands of Big Data

For decades, leading enterprises in science and industry, from retail and manufacturing to genomics research, government and healthcare have been engaged in an arms race, where the weapon-of-choice is insight. Decision-makers are constantly seeking a deeper, more granular insights into their internal operations and processes, or into their customers, competition and industry.

Big data and deep analytics are essential to any insight-driven organization, but they’re only raw materials. If fully exploited they’ll expose actionable intelligence, but left unprocessed they remain inert, yielding little or no value. Those seeking to mine greater insight rely on high performance computing (HPC) environments to power the conversion of those raw materials into a competitive advantage.

HPC is nothing new, of course, but the difference between HPC environments as they existed decades ago and now lies in capability, complexity and scale. The degree and speed at which today’s technology mines and analyzes big data – and enables users to leverage it – is unprecedented, at least until it’s eclipsed by tomorrow’s technology.

Inexorable advancements in technological capabilities drive organizations with big data and analytics-heavy applications to continually refresh their HPC environments in pursuit of higher throughput and greater compute and disk capacity. So, as limitless as the potential of HPC may be, the job of managing, maintaining and optimizing those environments is getting more challenging, the four most prominent challenges being:

  1. Maximizing the HPC environment’s flexibility
  2. Scaling the HPC environment rapidly
  3. Managing the HPC environment efficiently
  4. Avoiding tool sprawl


Maximizing the HPC environment’s flexibility

One benefit of HPC environments is a shared infrastructure, but when every user accesses the same set of resources, access must be controlled. This responsibility falls to the organization’s IT Manager or systems administrators, who must tailor the infrastructure to the requirements of a specific job.

Flexibility, then, is key, and maximizing that flexibility involves setting – and monitoring – infrastructure performance specifications based on particular jobs and workloads. It also means setting user- or group-level space quotas within a shared file system, for example, and monitoring those quotas to prevent a single entity from consuming an inordinate amount of shared resources.

Many organizations rely on third-party workload management platforms and tools to flex their HPC environment and limit individual or group resource consumption as-necessary for batch jobs.

Scaling the HPC environment

Internal customer demands dictate the requirements for every component of an organization’s on-premises HPC environment, from storage tiering to compute capacity. IT cannot afford to be blindsided. Nonetheless, if suddenly presented with multi-day job requiring CPUs and storage resources that simply are not available they need to respond quickly by scaling out storage, compute, network and any component of the HPC environment.

Recognizing it could take weeks or months to scale an HPC environment and deploy necessary physical hardware on-prem organizations avail themselves of numerous tools that augment and accelerate HPC environment scalability.  For example, IBM, with its Spectrum Scale cluster file system, enables tiered storage sharing or, leveraging the solution’s ‘Transparent Cloud Tiering” feature, on the cloud.

Efficiently managing the HPC environment

Infrastructure teams often find themselves spending too many work cycles acting as systems administrators, managing their organization’s HPC cluster and maintaining performance data not visible to the end-user, rather than serving those end-users by improving their overall experience – ensuring consistency and mitigating latency across the HPC environment.

There are numerous HPC tools purpose-built to lessen that sys admin burden – to manage HPC clusters. For instance, IT teams can use IBM Extreme Cloud Administration Toolkit (xCat) computing management and provisioning tool not only for day-to-day management, but to deploy and manage HPC clusters in a uniform way on a unified interface, performing image builds for Linux and configuring disk, networks and file systems.

Avoiding tool-sprawl

With so many 3rd party intelligent workload and policy-driven resource/file management tools, like IBM Spectrum Scale, dedicated to deploying and managing HPC clusters – and so many configuration management tools that provision environments, deploy applications and maintain infrastructures – it’s easy for IT organizations with HPC environments to experience tool sprawl.

Consolidating performance monitoring with Galileo Performance Explorer

No two HPC environments are identical, so there is no tool-specific out-of-the-box monitoring solution that provides an end-to-end view of performance across the entire environment. For instance, tools that come native with IBM Spectrum Scale will only monitor performance stats for that particular cluster.

Many HPC environments, in fact, are comprised of multiple clusters, and each of those clusters will have its own dedicated performance management solution, collecting high level statistics for CPU, memory, disk and network utilization for that cluster only. IT is compelled to jump from screen-to-screen in order to try to draw correlations across different tools, complicating and delaying issue resolution, frustrating users with latency issues.

Some IT organizations choose to move beyond those cluster- and tool-specific performance monitoring solutions and pursue a build-your-own strategy. This approach requires they dedicate significant time and resources to customizing and maintaining their own performance monitoring dashboards, while also collecting and managing the actual performance data.

There’s a third alternative, however, a single tool – Galileo Performance Explorer – which monitors performance across the entire HPC environment and every cluster, providing IT managers with a single-pane-of-glass view of end-to-end, granular, aggregated performance metrics.

Galileo Performance Explorer technology is vendor agnostic, with agents monitoring server, storage, SAN, database and cloud environments for a wide range of vendor solutions, including IBM Spectrum Scale. The Galileo solution is cloud-based and maintains all historical data at a granular level, freeing IT Managers to analyze performance trends and issues across their HPC environment for their predictive and preventive value.

Leave a Reply

Your email address will not be published. Required fields are marked *