Mention “IT outage” and thoughts turn to super storms — hurricanes, tornadoes or some other natural disaster causing widespread havoc with your critical data. Typically, though, what causes a data disruption is much more mundane. For instance, a power failure or a hardware glitch. Or (you can’t make this up), a squirrel, a dropped anchor, or burning cigarette butt left in the wrong place. They’re unusual, to be sure, but they still result in disruptions in IT operations.
Grasping this, chief information officers at many businesses – especially enterprises – are recognizing that their disaster recovery programs constitute more than just IT insurance. While they often spend heavily to establish duplicate data-storage systems, often located far from the main IT databank, so they can recover their invaluable data swiftly should a calamity strike, they also realize that few IT outages – 14 percent, by one estimate (DR Benchmark study) — are weather-related.
Consequently, more CIOs recognize they must be resilient when any outage occurs and ensure their organizations can access the applications and services to run their business at peak performance. These CIOs, especially at enterprises, are focusing much more heavily on automating their IT processes and infrastructure, especially as they move more and more of their data to the cloud and virtualized environments.
The two aren’t interchangeable although the technologies are similar. Virtualization is software that makes it possible to run multiple operating systems and applications on the same server at the same time. It is the fundamental technology powering cloud computing. As for cloud computing, while it is software that manipulates hardware, it is a service that results from that manipulation and delivers shared computing resources, software or data.
IT interruptions and failures do occur in the cloud, dispelling the myth that the cloud is invincible. And when they occur, they can prove very serious because cloud services often serve far more people than locally run operations. So they attract great attention.
In 2014, for instance, notable IT outages in the cloud disrupted operations at Dropbox twice in two months, Google three different times, Samsung’s Smart TV for 4½ hours, Adobe for about 28 hours and Microsoft twice in two days, among others. And none of these incidents were sparked by a natural disaster such as the devastating Hurricane Sandy that disrupted hundreds of IT operations along the northern Atlantic Coast region in 2012.
Most Common Causes of IT Outages
What are the most common causes of IT outages other than from weather-related and other natural disasters? In 2013, the Ponemon Institute and Emerson Network Power updated a previous study of data center outages, including the root causes. The most frequently cited root causes of outages include battery or other equipment failure from or capacity exceeded at the Universal Power Supply (UPS); human error; cyberattack; IT equipment failure; flooding or other water-related problem; heat-related computer room air-conditioning failure; or a power distribution unit or circuit breaker failure.
83 percent of the 450-plus U.S. data centers surveyed said they knew the root cause of the unplanned outage, and 52 percent believed all or most of the unplanned outages could have been prevented.
Such outages can prove costly. In their study, the Ponemon Institute and Emerson Network Power found these average downtime costs of an incident:
- IT equipment failure ($959,000 average cost)
- Cybercrime ($882,000); UPS system failure ($678,000)
- water, heat or CRAC failure ($517,000)
- generator failure ($501,000)
- weather incursion ($436,000)
- accidental/human error ($380,000)
Automating DR Strategy in the Cloud
The prevalence of commonplace causes of IT outages as well as the hefty expense that outages can trigger underscore why more CIOs are automating their disaster recovery systems and infrastructure. They seek to make data recovery agnostic as well as the technology used for their recovery strategy. They achieve this by determining the best solution to each situation – cloud, virtualization or otherwise – and writing procedures that can be executed manually or automatically when disaster strikes.
This is critical because not all data must be recovered immediately. So-called Tier 1 applications, the most vital data apps, may require to be up and running in four hours or less, while eight hours or even longer periods may suffice for Tier 2 applications. It’s analogous to assembling furniture. Each piece is unique so you must refer to the instructions or manual to follow the correct assembly procedures.
This explains why a strong asset-management system is the first step to good disaster recovery. Reflecting this, more enterprises are using an automated discovery system and dependency mapping tools to chart an organization’s entire IT infrastructure. Such a configuration management database, or CMDB, contains all relevant information about the information system’s components and their relationships. It serves as a powerful blueprint.
After taking this initial step, an enterprise or its outside vendor gives this blueprint to a solution architect who develops a fully automated procedure-development plan that levers the CMDB. In essence, this becomes the 3D printer of IT by creating an application map. It reads the model provided by the CMDB and creates an execution workflow that is readable in XML format. This approach saves much time; an infrastructure build time declines as much as 90 percent.
For virtualized environments, the benefits range from cost efficiency to resilient data centers with unparalleled uptime and recover periods. Shaklee Corporation has seen these benefits firsthand from its vendor, Sungard Availability Services.
“In addition to reducing our recovery time objective, working with a IT managed services provider allows us immense improvement in our disaster recovery coverage,” said Doug Brown, IT quality assurance manager at the dietary-supplements maker. “We use to operate with 77 critical boxes and depended on tapes to be sent to the recovery site. This would take us more than 72 hours. After adopting a virtualized environment, we are able to begin the recovery process instantly and be up and running faster.”
Another customer, Avery Dennison Corp., a leader in global packaging trends, is able to streamline its IT department, allowing the staff to focus on business objectives outside of technology support by relying on a third-party provider.
“Building a virtualized environment enables us to meet business requirements more effectively,” said Tom Webster, Avery Dennison’s Corporate Disaster Recovery Program manager. “Through deployment, we reduced 556 servers to 139. Additionally, a virtual environment streamlined the design for our DR capability and proved a key component in meeting DR requirements.”
Structured Change Process Key to Protection
What’s the best internal approach to managing IT change, especially in a virtual environment? A centralized operation team should manage the process, including the IT leadership team. The group should assess the risks inside the environment, using a risk-and-impact model for planning. That model applies to situations that have a low risk of failure but a huge impact if it occurs, as well as situations with a high risk of an outage but with a lower impact.
For instance, a routing change rarely triggers an outage, but if something goes wrong with the change, the impact is significant. On the other side, a major software rebuild on an individual server can cause lots of problems. But the impact on the environment is low because it only involves one leg of the entire IT network environment.
If an issue emerges, an organization should possess a structured set of response criteria of what to do about it – an incident-management system – and understand those responses very well. The roles of participants in the response process also should be well-defined.
What to Expect from 3rd Party Cloud/Virtualization Services Provider
Organizations that retain a third-party cloud and/or virtualization services provider should make sure their contracts include various specifics. For instance, buying cloud services doesn’t mean that DR services are included. In addition, the contract should specify what conditions constitute a disaster. Don’t simply let a cloud services provider dictate when a disaster executes its DR plan.
Further, make sure that the service-level agreement that applies to cloud services fits the organization’s specific needs. Generally speaking, an organization should obtain that level of granulation of services for cloud virtualization, clearly determining what DR and business-continuity services are included.
Testing is another critical element to consider when employing a cloud services provider, and most organizations with cloud computing do sign with a third-party vendor. Many cloud service providers emphasize the importance of testing annually, at a minimum. Many companies test quarterly, especially enterprises, and they divide their operations and focus on certain ones each quarter so that all operations are tested once a year. And be sure the cloud services vendor provides analysis reports of testing.
So What Triggered Those Strange Outages?
So, how did that squirrel, dropped anchor, burning cigarette butt create outages?
In 2010, the squirrel chewed through critical wires used to transfer communications and took out half of Yahoo’s Santa Clara, Calif., data center. As for the dropped anchor, in 2008, a ship dropped its anchor on one of the undersea cables carrying traffic from continent to continent and caused downtime for some regions.
In western Australia, a Perth data center was shut down for an hour after its smoke-detection apparatus system detected smoke in the center. It was caused by a smoldering mulch-filled garden bed alongside the outside wall of the facility, and a burning cigarette butt was the likely cause.