Does your company have a Disaster Recovery Plan? In larger companies, arguably even in the smaller ones, there is some sort of process in place should the fit hit the shan. The Planet, one of the larger hosting providers online, had a chance to practice their processes this weekend. On Saturday, the Houston datacenter, held over from the EV1/Planet merger, suffered an explosion and as of this writing (2:00 AM EST 2008/06/02) the power is still out.
Update: (Original story resides on page two of this article)
“As previously committed, I would like to provide an update on where we stand following yesterday's explosion in our H1 data center. First, I would like to extend my sincere thanks for your patience during the past 28 hours. We are acutely aware that uptime is critical to your business, and you have my personal commitment that The Planet team will continue to work around the clock to restore your service,” said Douglas Erwin in a recent update on the outage. (11:00 PM CDT)
“As you have read, we have begun receiving some of the equipment required to start repairs. While no customer servers have been damaged or lost, we have new information that damage to our H1 data center is worse than initially expected. Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed.”
It is cases like this that prove you simply have to have a solid recovery plan in place. It is easy to assume good policy and planning after the fact, but it is likely that they did not foresee the “walls come tumbling down” so to speak. However, they are still working according to Erwin. He wrote that they have figured out a way to give power to Phase 2 (Upstairs in H1) and they plan to restore 6000 servers in about four to five hours.
The bad news, however, is on the first floor, or Phase 1. “Let me next address Phase 1 (first floor) of the data center and the affected 3,000 servers. The news is not as good, and we were not as lucky. The damage there was far more extensive, and we have a bigger challenge that will require a two-step process. For the first step, we have designed a temporary method that we believe will bring power back to those servers sometime tomorrow evening, but the solution will be temporary. We will use a generator to supply power through next weekend when the necessary gear will be delivered to permanently restore normal utility power and our battery backup system. During the upcoming week, we will be working with those customers to resolve issues.”
Erwin knows, as does the rest of the staff at The Planet, that many companies are depending on this issue to be quickly resolved. This article previously focused on disaster recovery as it applied to the outage over at The Planet. It is obvious that their disaster recovery plan is working, however the problem is some clients feel it is working too slow. This shifts the focus to disaster recovery for the clients of a datacenter that suffers a similar fate.
How would you deal with this? Most of the clients in the H1 Datacenter are webhosting companies. Webhosting companies who now, because of an outage caused by their provider, have to deal with their own customers and offer answers as best as they can. Is it fair for those hosting companies to blame The Planet for their losses? Some will say yes, and have a valid point, but there is also the other side to that coin. Why have these hosting companies not created a disaster recovery plan of their own in the event an outage such as this takes place?
Some would argue costs, as most of the companies can not afford the costs such an infrastructure would incur. Those companies maybe have one or two servers, and host a few hundred clients. If costs were the only factor preventing those companies from protecting their business assets, namely their customers, then they should consider closing up shop. The customers who have been most vocal are companies who have lost sales, clients, and other revenue, because they placed all of their eggs in one basket.
Head over to WebhostingTalk.com where and you can see the good and ugly side of the hosting business. “I'm fed up waiting for more information about these terrifying issues. Communication is poor and every minute my site is down is costing me $1000s in ad revenues. This is probably going to cost me a $10,000,000 sponsorship deal with a major brand advertiser,” one WHT member C0bra stated. This is a prime example, if his claims are true, of why a hosting company or business that depends on a website, should always have a backup plan. Most of the comments after this post point that out, and some take issue with his claims.
“Let's drop a bomb in your living room and then see you use your fire escape ladder to get off of the non-existent second floor. Sometimes backup plans just don't work out - you can't plan for everything but you can plan for the most common occurrences and I would not say that this is a common occurrence,” said MikeDVB about the outage.
There are several positive comments on WHT surrounding the outage. However, the fact an outage took place at all will cause many to complain.
On another note, there have been some questions, both in the comment section and in my email box, centering on the time frame for updates handed out by The Planet. Starting with the first comment to the original article, there was no long gap on the news centering on the outage.
“Today at approximately 5:45 p.m., a transformer in our H1 data center in Houston caught fire…” This is how the post (second in the update thread on the forum) started. It was posted at 7:36 PM. The first post in the update thread, from Tomy Durden says, “The Planet is currently experiencing an outage which is affecting a number of customers' servers. This issue may also be affecting customers' ability to get through to our call center.” This was posted at 6:29 PM.
If the outage started at 5:45 PM (or even 4:55 PM CDT as mentioned later in the update forum), there was communication less than 45 minutes after the fact. One hour forty-five minutes later, there was another post with more information. Including the cause for the outage, “…a transformer in our H1 data center in Houston caught fire, thus requiring us to take down all generators as instructed by the fire department. All servers are down.”
This is incredible speed for a service provider of this size. In under two hours, there were two updates, and a reason was offered explaining the service interruption.
You can track the latest outage news from The Planet here:
The outage affected 9000 servers and 7500 customers. “This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost,” Kevin Hazard said during his update post on the forums over at The Planet. “We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.”
As fast as The Planet could get them there, vendors, all members of the support team, and senior staff were on site. Kevin’s update on the issue kept up all night, and after he was done, Todd Mitchell stepped up and took over the updates process. One of the things you need with a Disaster Recovery Plan (DRP) is communication. According to those who are obviously not familiar with IT, mostly dime-a-dozen hosting resellers, The Planet is not known for effective communications. It is obvious they are effective, as the media, clients, and average people know what happened and why, and what was done about it.
The typical comment looked like this yesterday evening, “The DC can run off of the power provided by the backup generators. Why isn't the DC Running off [sic] backup power? I assume the switch was something connected via the street/pole that blew. If it was a "high capacity" transformer/switch why is the entire building in the dark, including every router without an real idea of what's going on 24 hours later?” This comment came from a well-known webhosting forum.
Instead of discussion over IT, policy, and Business Continuity Planning, the forum is mostly complaints about the outage. The fire department ordered power cut; this rendered the instant usage of backup generators moot.
Swa Frantzen over at SANS made and interesting comment on the issue, “I had seen plans for BCP/DRP derail before due to officials stepping in and doing their response to an emergency in their way and not in the way the organization itself had planned it. I think it would be interesting for most of us to actually talk to fire departments and/or police officers on what their normal responses are and take them into account in our plans...”
No one is sure why the fire and explosion happened, lots of people offer various theories, but those are all well meaning guesses. The fact is, as soon as disaster struck, The Planet took action.
Here is what they did right:
Communication – It was constant, and timely. They made everyone aware of what they were doing and why.
Assessment – With the help of staff and vendors, The Planet was able to assess their equipment and see what, if anything, was needed to start the process of full recovery. “As you know, we have vendors onsite at the H1 data center. With their help, we’ve created a list of equipment that will be required, and we’re already dealing with those manufacturers to find the gear…”
Priorities – On the forum where updates were posted, The Planet listed four items that were the top priorities for the recovery. It is likely this list was made onsite as the disaster was entering the cleanup phase.
Resource sharing – After the network in the H1 center was assessed, they started a migration process for management controls and other critical functions. They began using resources from their other datacenters to help speed up the process.
If you were to criticize The Planet, it would come under the communications part of their plan. There is no news on their blog or their main page about one of their datacenters being down. While they are using the forums to communicate, the other areas of notice should have some attention as well.
A company disaster plan should start by one simple rule. What is it your business needs IT wise to survive? This is where you start. Then you plan how you will go about recovering data, equipment and getting it up quickly. The disaster recovery process should include the entire company and the entire IT team. Each person of the team should have a role in the process. Vendors need to be apart of the planning, and implementation process as well. (As witnessed for the outage at the H1 center, vendors were onsite at the same time as the support crew.) Chime in and tell us how you manage your disaster recovery.
The power is still out at The Planet H1 center, the support people on the phone will only say that all hands are working to restore the power infrastructure and there will be more information posted online.