In the digital age of connected systems, networks and ecosystems, companies conduct business 24-7 irrespective of physical location or time zones. Corporations expect their systems to be “always on” and their users expect data to be immediately accessible.
Unplanned downtime—even for a few minutes—is risky, expensive and unacceptable.
When servers, OSes or applications fail for any reason, productivity ceases. Unplanned downtime can also result in lost, damaged or destroyed data. When systems are inaccessible for longer durations, a domino effect occurs as the organization’s customers, business partners and suppliers are likewise unable to access data to conduct business and process transactions.
Downtime is also costly: Businesses stand to lose thousands to even millions of dollars per minute. A 2016 report from the Ponemon Institute found that the average total cost of a data center outage is $740,357 (bit.ly/2nrfa3w). In polling 63 U.S. data centers, the study also found that the price of downtime ranges from $593 to $17,244 per minute, depending on the specific incident and vertical market sector.
Top Causes of Downtime
Devices, applications and networks can become unavailable for a wide variety of reasons.
The chief culprits of unplanned downtime are: human error; security; hardware- and software-related problems; interoperability and migration issues; and natural disasters.
Information Technology Intelligence Corp.’s (ITIC) 2018 Global Server Hardware, Server OS Reliability Survey (ibm.co/2LCX6gz), which surveyed more than 800 customers worldwide, found that 59 percent of respondents cited human error as the No. 1 cause of unplanned downtime (see Figure 1). Security ranked a close second (51 percent) as a cause of server hardware, OS and application outages. Other causes of downtime include software bugs and flaws (29 percent); inadequate server hardware (22 percent); and complexity in configuring and provisioning new applications (21 percent).
Business decisions also play a pivotal role in exacerbating or mitigating downtime. Factors include: failure to allocate the necessary funds to upgrade systems and applications; failure to provide crucial training and certification for IT and security administrators; and failure to implement computing policies and procedures such as performing regular backups and having a comprehensive disaster recovery (DR) plan in place. External business factors, like regulatory compliance violations in finance, agriculture, healthcare and transportation, can result in on-site inspections and litigation, forcing organizations to shutter operations for days or weeks before the situation is resolved.
“The root causes of downtime are the same as they were 20 or 30 years ago. But there are more of them,” says Andrew Baker, president of Gassaway, West Virginia-based Brainwave Consulting, which specializes in IT operations and cybersecurity.
The prevalence of technologies like virtualization, cloud computing, mobility and the Internet of Things (IoT), which link servers, applications, devices and people, potentially heighten the risk and severity of downtime occurrences. “Now that so many systems are interconnected, there’s a much higher risk of collateral damage—more servers and devices can get taken out at once—even if they’re isolated in containers,” Baker says.
Post-outage remediation is time-consuming and costly. “Unplanned outages are a risky business and they raise the risk of litigation and damage to the company’s reputation,” Baker adds.
Human Error and Downtime
Human error spans many issues, including common mistakes such as:
- Misconfiguration of server hardware, OSes, applications and devices
- Failure to upgrade or right-size servers to accommodate more data and compute intensive workloads like virtualization, data analytics, artificial intelligence (AI) and storage
- Failure to upgrade outmoded applications that are no longer supported by the vendor
- Failure to keep up to date on patches and security
“Misconfiguration issues are the ‘common cold’ of the data center,” Baker says. “They invariably arise as a result of human error when IT departments are overburdened. And that happens frequently.” Additionally, many organizations no longer send their IT and security administrators for training and certification. This can perpetuate human error, given the proliferation and complexity of configuring and deploying new technologies.
Examples of common misconfiguration errors include:
- Failure to adhere to best practices during initial hardware and application setup
- Improperly configuring high availability failover servers in web farms and server clusters
- Configuration files that contain incorrect information
Adding to the list are classic “dumb” mistakes such as unplugging power cords, leaving a crucial port open, failing to disable a guest account, not adjusting the temperature in the data center or, forgetting to monitor server or disk capacity until it fails or the machine’s performance slows drastically.
Human error isn’t relegated to technology concerns: It also encompasses lapses in judgment by C-level executives and IT departments who opt to make decisions based strictly on budgetary concerns. Every enterprise has finite financial resources and caps on its annual expenditures, but it’s a big mistake to be “penny wise and pound foolish” when it comes to the core server and main line-of-business applications.
Examples of such short-sighted decision making include:
- Failure to allocate the appropriate capital and operational expenditure funds for new equipment purchases based on a two-to-three-year upgrade cycle. Failure to implement upgrade policies and procedures to address issues like cloud computing, mobility, remote access and bring your own device
- Failure to construct and enforce strong computer and network security policies
- Failure to calculate total cost of ownership (TCO) and ROI
- Failure to track hourly downtime costs.
- Failure to track and assess the impact of service-level agreements and regulatory compliance issues (e.g., Sarbanes-Oxley or HIPAA)
Security and Downtime
Security is part and parcel when it comes to downtime. The proliferation of mobile devices, endpoints and IoT deployments means that the attack surface has grown commensurately. Businesses now have many more potential vulnerabilities and entry points into their systems, servers, applications and devices. Security professionals and IT administrators have much more to monitor and manage.
External hackers are now more organized and sophisticated—they pick and choose their targets and hammer away until they succeed. Security professionals are fond of noting that “Corporations have to ensure security all of the time, while a hacker only has to be right once.”
Internal threats, such as disgruntled employees or corporate espionage, also present a real danger.
Corporations must also track and repel an assortment of ever-more pernicious and pervasive security threats including viruses, ransomware, malware, phishing scams, bots, trojans, brute force attacks, Denial of Service, attacks on firewalls, switches and unified communication systems.
Another thorny aspect of security is that some vendors—particularly niche market application vendors—sometimes take weeks or even months to acknowledge and respond to security flaws in devices and applications. The longer the lag time before the vendor releases a patch, the higher the risk that organizations may experience a successful penetration.
Software and Hardware Failures
Software and hardware failures still cause unplanned downtime, although technology advances in the last decade have increased the inherent reliability of software, server hardware and its underlying components. For instance, IBM POWER9* processor-based servers incorporate embedded security, reliability, on-chip analytics and predictive management/maintenance features. These capabilities are designed specifically to handle compute-intensive workloads such as databases, data analytics and AI. And built-in security helps organizations identify and thwart security threats.
Another common cause of unplanned server downtime is uninterruptible power supply because it can take the servers and applications down.
Hard drive failures, particularly in aging hardware (over three and a half years old), are another persistent cause of server and application crashes. Companies that overload their servers without retrofitting or upgrading the hardware to accommodate larger application workloads are asking for trouble. Even if the hard drive doesn’t crash, aged and inadequate servers can result in performance bottlenecks that slow response time, over-utilize system resources and create sporadic system failures before crashing an application or system entirely.
Unplanned downtime can also occur when IT administrators update drivers, firmware and applications—especially if the new software or drivers contain myriad new features. It’s recommended that you fully test and debug new features and drivers in a pilot network before putting them into full production.
Natural and Man-made Disasters
The United States has experienced 233 weather and climate disasters (e.g., droughts, wildfires, hurricanes and floods) since 1980 in which overall damages/costs reached or exceeded $1 billion. This is according to statistics compiled by the National Oceanic and Atmospheric Administration’s (NOAA) National Centers for Environmental Information (NCEI), which tracks U.S. weather and climate events. NOAA’s NCEI Report, “U.S. Billion-Dollar Weather & Climate Disasters 1980-2018,” says that the total cost of these 233 events exceeds $1.5 trillion (bit.ly/2nXan9d).
NOAA’s NCEI’s statistics show that from 1980 through 2017 the U.S. averaged six weather- and climate-related disasters costing $1 billion or more, annually. However, the frequency and the cost of these natural disasters is climbing. From 2013 through 2017, the number of U.S. weather-related disasters costing over $1 billion each has nearly doubled to 11.6 events, according to NOAA’s NCEI. As of July, six natural disasters have occurred with losses exceeding $1 billion each.
These events have left significant and prolonged data center outages in their wake. During 2017 alone, 16 natural and climate-related disasters occurred. Three of them—Hurricanes Maria, Irma and Harvey—caused a total $265 billion in damages across six states (Alabama, Florida, Georgia, Louisiana, South Carolina and Texas) and Puerto Rico, according to NOAA and the Federal Emergency Management Agency. These three hurricanes crippled the power grid and downed entire network ecosystems for weeks or months; some locales like Puerto Rico have yet to recover.
Man-made disasters can also cause unplanned downtime. After the Sept. 11, 2001, terror attack, many businesses in lower Manhattan experienced systems outages ranging from a few hours to several days, depending on backup and DR preparedness, as well as physical proximity to ground zero. A 2002 report jointly prepared by The Federal Reserve, the New York State Banking Department, the Office of the Comptroller of the Currency and the Securities and Exchange Commission emphasized the need for a comprehensive, proactive approach to business continuity and DR (bit.ly/2BIpnmj). Among the findings:
- Contingency planning at many institutions generally focused on problems with a single building or system. Some firms installed backup facilities in nearby buildings, failing to consider that damage could disrupt an entire business district, city or region.
- Many firms believed they had achieved redundancy in their communications systems by contracting with multiple telecommunications providers and diversifying routing, only to discover that all of the lines traveled through several single points of failure
- The Sept. 11 terror attack underscored the interdependence among financial system participants, irrespective of geographic location. The report stated, “While organizations located outside the New York City area were affected to a much lesser degree than were those within it, many felt the effects of the disaster. Most lost connectivity to banks, broker-dealers and other organizations in lower Manhattan, which impeded their ability to conduct business and determine whether transactions had been completed as expected. Additionally, some customers were affected by actions of institutions with which they did not even do business, when funds or securities could not be delivered due to operational problems at other institutions.”
“You can’t always anticipate natural and man-made disasters but you have to be prepared for them because they can strike at any time,” notes Steve Sommer, CIO at Stromberg & Forbes LLC, a financial services company with offices in New York and Florida. “In the wake of Sept. 11, we made backup/DR and business continuity testing a priority.”