In recent years, data center infrastructure has become significantly more reliable and management practices have improved, so it would be fair to expect that the number of reported downtime incidents is decreasing. But this isn’t the case.
According to a 2018 survey by Uptime Institute, 31% of respondents experienced a downtime incident or severe degradation in the last year and 48% reported at least one outage at their site or at a service provider in the last three years.
Downtime is expensive. It costs both time and money and can have grave consequences for organizations that are not sufficiently prepared. According to Gartner, downtime costs $5,600 per minute on average. This results in average costs between $140,000 and $540,00 per hour depending on the organization. Some factors that contribute to the costs associated with downtime include:
- Lost sales. For organizations that do business online, downtime directly results in customers being unable to make purchases, losing potential revenue. If the business is dependent on network availability to deliver a service, downtime makes it impossible to communicate with users.
- Brand reputation. If customers frequently have to deal with outages that prevent them from easily making purchases or using a service, they will cease being a customer and share their bad experiences, scaring away potential customers.
- Reduced productivity. Modern businesses are heavily dependent on online communications and services. Without network access, productivity often grinds to a halt as employees lose the ability to get the majority of their work done, production lines shut down, or other aspects of the business are stunted.
- Payouts. Some companies include language in SLA uptime contracts that defines compensation owed in the event of unplanned downtime.
- Lost data. During outages, data can be corrupted and opportunities can be created for cyberattacks that damage data. Data is typically backed up, but the outage can scare customers and shatter their confidence.
The number one cause of data center failure is human error. Other common causes are network failure, power outages, UPS system failure, natural disasters, and cyber crimes. Fortunately, there is a solution that helps prevent downtime.
Data Center Infrastructure Management (DCIM) software allows data center mangers to avoid unplanned downtime that can cost hundreds of thousands of dollars per outage and wreak havoc on your business. Some of the ways to prevent human error and maximize uptime with DCIM are:
- Manage inlet air temperature and humidity. The temperature and humidity of air at the inlet of cabinets is important because this is the air that flows through the cabinet to decrease the heat. If the inlet air is too warm, the cabinet won’t cool properly. If the air is too humid, there is a risk of corrosion and damaged equipment. And if the air is too dry, there could be a static electrical discharge. All of which these can cause costly downtime. DCIM software collects data from environmental sensors in the data center and displays the information in business intelligence dashboards and 3D floor map visualizations to help you monitor your data center environment and identify hot spots.
- Safely increase temperature. Increasing temperatures in the data center can improve energy efficiency, but it comes with the risk of overheating and damaging equipment, resulting in downtime. With DCIM, you can set temperature thresholds and receive alerts when temperatures are outside of your desired range. Similarly, DCIM will help you avoid overcooling to optimize efficiency and reduce energy costs.
- Ensure power redundancy. Due to the increasing demand of computing hardware, data center cabinets are now packed more densely with power-hungry IT equipment. And since data center teams are often focused on fully utilizing existing resources and delaying capital expenses, they may not be aware that a cabinet is overloaded until it’s too late. This makes power redundancy in the event of equipment failure a critical component of any strategy to maximize uptime. DCIM software allows you to run a failover simulation report and identify what cabinets are at risk and what equipment can continue functioning safely if a PDU goes down. Data center managers can leverage this information to make necessary changes to the loads before there is a real failure.
- Health polling. Ensuring that intelligent PDUs and other devices are operating properly and accessible via your network is important to maintaining uptime. It’s not impossible for equipment to go down without anyone noticing. A technician or engineer may place a PDU into maintenance mode accidentally, neglect to power on new resources, or connect equipment by the incorrect ports or cables. With DCIM software, you limit the possibility of outages caused by malfunctioning equipment by polling intelligent PDUs and other equipment at user-configurable intervals to ensure that they are accessible. If the device is not reachable, the software alerts you immediately so you are aware of the issue before there is a crisis.
With DCIM, you can simulate failover and test what-if scenarios with reports that identify available capacity to ensure coverage in case of failure, visualize data center and facility health status with a red-yellow-green color-coded health map that provides an at-a-glance view of rack load levels, line currents, and environmental conditions, and be alerted of threshold violations with automated emails that enable the quick identification of hotspots and potential trouble issues. With these capabilities, DCIM will help protect your infrastructure in the event of a data center disaster.