In today’s “always connected” world, the level of acceptable downtime for enterprise software applications is rapidly approaching zero. No longer is “99.99% available” an acceptable expectation. For many corporations, enterprise applications are mission-critical assets that must be available 24×7, 365 days a year.
However, according to a recent survey, more than 90% of datacenter operators responding reported that their datacenter had at least one unplanned outage in the past two years. The study was sponsored by Emerson Network Power and conducted by the Ponemon Institute, and surveyed 584 individuals responsible for datacenter operations in the US. Respondents represented organizations in a wide range of industries, and facilities were evenly distributed geographically.
To be sure, there are many possible causes of application downtime. But those caused by the equipment itself have been nearly eliminated by the virtualization of the physical IT infrastructure. By minimizing or removing single points of failure in the servers, storage and networking systems, hardware failures rarely cause application downtime today. Software has experienced similar improvements in both stability and self-recovery capabilities. The effort to abstract applications from the underlying IT equipment has culminated in the “Software Defined Data Center (SDDC)”, and indeed it has provided tremendous improvements in software reliability by virtualizing all compute, storage and network resources and delivering them as a service.
In addition, Data Center Infrastructure Management (DCIM), also known as “Software Defined Infrastructure”, has emerged to provide real-time visualization and dynamic cooling adjustments and these capabilities in turn help minimize outages and improve inefficiencies inherent in any power distribution system and also help improve the Power Usage Effectiveness (PUE) of the data center facility.
So, now that we have SDDC and DCIM, what is causing all these failures? While IT infrastructure reliability has significantly increased, external and internal problems with power appear to be responsible for more than 50% of application failures. In the Emerson/Ponemon survey, the vast majority of respondents (85%) reported that they had lost primary utility power to their facility at least once during the past two years. Almost half reported only one to two power interruptions, but virtually the same percentage reported three to 10 outages, and total datacenter shutdown was not uncommon: 73% of those respondents who reported an unplanned outage had one to two complete facility shutdowns in the preceding 24 months. The rest (27%) had even more – three to five complete outages.
Utility-supplied power has never been 100% reliable. According to data tracked and analyzed by Lawrence Berkeley National Laboratory (LBNL), the US electrical grid has averaged about one major blackout per decade for the past five decades. In 2012, LBNL conducted a more detailed examination of readily available data from approximately half of all US utility providers to analyze reliability over the past decade. Among that sample, the average duration and frequency of power interruptions have been increasing at a rate of approximately 2% annually. There are probably many reasons for this increase, but the rapidly widening deployment of intermittent renewable energy resources, and the dramatic increase (according to the US DOE) of weather-related power outages in recent years may be contributing factors.
So while the “software defined data center” and “software defined infrastructure” provide important mitigation against problems within a data center, they cannot reach the desired goal of 100% uptime. Most companies and organizations have multiple, geographically-dispersed data centers to protect against problems at any single site, and assume that if something does go wrong in one, the other will take over instantaneously. But failover procedures require error-prone manual intervention 80 percent of the time. A study by Symantec found that even before you get to manual intervention, 25 percent of disaster recovery failover tests fail completely at the outset. In addition, while 99.999% uptime means less than 10min of downtime for the facility, the consequences of such a downtime to the application can be much larger. Recovering the facility from a power outage can go quickly, but getting all the applications back up can be very time consuming and can require a lot of manual steps, and it is the availability of the application that really counts.
SDDC and DCIM solutions are necessary for maximizing application reliability in an energy-efficient data center. But they are not the end-state, as neither can abstract the applications from power. It should come as no surprise, then, that power is the next (and last) resource in the reliability chain, waiting to become “software-defined”.
Software Defined Power is an emerging solution to application level reliability issues caused by power problems. Software Defined Power, like the broader SDDC, is about creating a layer of abstraction that isolates the application from local power dependencies. As power cannot be moved in the same way as virtual servers, storage and networking, Software Defined Power moves the application load, shifting the associated power consumption with it and thereby increasing application reliability. It actually completes the SDDC, further reducing the risk of outages by matching the IT application load level to the most reliable and cost effective source of power at any given time. Software Defined Power moves the application load, not just in a disaster situation, but continuously and dynamically, in accordance with application service levels.
In addition to increasing availability by affording greater immunity from unplanned downtime caused by volatile power sources, shifting application workloads across data centers also makes it easier to schedule the planned downtime needed for routine maintenance and upgrades within each data center. Together these improvements have the effect of maximizing application uptime with absolutely no adverse impact on service levels, performance or quality of service.
Obviously this cannot be done in isolation and any automated solution must integrate with all the other software defined components throughout the data center. The solution must seamlessly integrate, to provide maximum reliability, and it must be fully automated. Software Defined Power implementation must bring together application monitoring, IT management, DCIM and power measurement, adding enterprise scale automation, analytics and energy market intelligence to produce a truly integrated solution.
While the cost savings that result from avoiding application downtime are real and substantial, they are difficult to quantify. However, the ROI on an investment in Software Defined Power can be measured in other ways. Fortunately, the energy savings alone that are enabled by a deployment make for a very compelling case. Typically these energy savings come from turning off spare or underutilized equipment temporarily and as part of the routine automation that will also shift the application from one data center to another.
Shifting some or all of an application workload to a distant data center automatically makes it possible to shed that load locally, and this is a source of significant savings. Powering down some or all of the local servers until they are needed again affords a reduction of up to 50 percent in the energy needed, including for cooling. Additional savings can be achieved at the “active” data center(s) during non-peak periods by continuously matching server capacity with the actual application workload, provided there is some spare capacity available to handle unexpected spikes in demand.
The ability to seamlessly and automatically move application workloads from one data center to another provides one additional way to save—or more accurately make—money: by participating in lucrative Demand Response (DR) and ancillary service programs. Under these programs, electric utilities provide substantial cash payments to organizations willing and able to reduce power consumption during periods of peak demand on the grid. During a DR event, which usually occur late in the afternoon on hot summer days, the Software Defined Power solution could shift the application workload to a data center in a different time zone, temporarily reducing the local power consumption. In addition to the savings in power consumed, there is an extra payment for doing so.
The final area of cost savings derives from the fact that when power is the most available and dependable, it is also the most affordable. Because this mostly occurs at night, shifting the application workloads to “follow the moon” means an organization will always be paying the lowest rate for electricity, and will always be using less of it based on the ability to cool the facility—in whole or in part—with outside air.
The total cost savings make a compelling case. It is often possible to recover the investment needed to implement a Software-Defined Power solution within one or two years, while achieving all the benefits of isolating the application from power related problems, an important additional step to ultimate application reliability and 100% availability.
Clemens Pfeiffer is the CTO of Power Assure and is a 25-year veteran of the software industry, where he has held leadership roles in process modeling and automation, software architecture and database design, and data center management and optimization technologies.