(
作者/来源:伊路网络
发布日期:2014/03/06
)
好消息,此次故障至3月7日13时已完全修复!
以下为服务商修复故障后的邮件节选:
Good news! The disaster is over. 100% of the affected email and database servers are back online!
We've begun our formal investigation into the cause of the problem, and I'll be publishing the post-issue report on our main blog next week.
Thank you all for your patience and understanding during this crisis. Words simply cannot express how sorry I am for letting you down like this. Please accept this apology on behalf of myself, and our entire team. We never want to disappoint you, and we'll do everything possible to prevent an issue like this from happening again.
注:此次故障除引起访问不稳定外(偶尔访问慢或中断),未影响数据安全。
感谢大家的谅解与支持!
******* 以下为较早前的通知 *******
尊敬的客户,
非常抱歉的告知您:自3月2日起,因美国IX服务商磁盘系统故障,导致我们放置于美国IX服务器上的约十余位客户网站受到波及。6日下午起,网站访问出现不稳定的现象,受影响的网站4组IP地址段如下(网站IP地址可以通过www.ip138.com查询):
98.130.*.*
69.6.*.*
50.6.*.*
66.116.* .*
我们将持续跟进服务商的问题解决情况,必要时将启用备用空间应对。如任何客户认为有必要更换空间的,我们将免费为您调换空间。
* 此次故障对Godaddy美国空间、美国佛州和德州空间、香港空间等均无影响,如果通过查询您的网站IP地址后确认不在上述四组之内或网站正常访问,则请忽略本公告。
由此给您造成的不便敬请谅解,谢谢!
如有任何问题,请与QQ20508031联系或致电0731-86104666
伊路网络工作室
2014年3月6日
****************附服务商邮件节选****************
I've been over-simplifying this outage by descripting the core issue as a "RAID failure." This is, in part, because my system administrators have been so busy trying to bring everything back online, that I did not take their time away from restoring service to talk with them in detail about what went wrong. For example, our Manager of Hosting Infrastructure has put in 52 hours over the last 3 days.
Many of our more technically apt customers have been concerned that a simple RAID failure shouldn't cause this amount of chaos in a company such as ours. That's a correct assessment and I'd like to address it to avoid false speculation.
In short, this is much more complex than a simple "RAID failure." Here's a more technical explanation:
Our Setup:
The affected cluster of servers uses a SAN, which is made up of 5 storage arrays in a RAID configuration. Each storage array consists of 14 enterprise-grade 15,000 RPM SAS iSCSI-connected hard drives with 2 hot spares. This is a '14+2 RAID 50' storage array.
What Went Wrong?
During regular integrity scans, the RAID controller for one of the 5 storage arrays in this SAN recognized degraded service on Drive 6. So, as designed, it automatically activated a hot spare, Drive 10, and began rebuilding the RAID array. The RAID controller selected Drive 0 as the spare's source drive for data parity to restore the data on Drive 10. Almost immediately after this rebuild began, Drive 0 failed. The loss of Drive 0 during the sync corrupted the data being built on Drive 10. The RAID controller recognized the data on Drive 10 as unreliable and kicked it out of the RAID to protect data integrity. Three unusable drives exceeded the fault tolerance of the storage array. This is the point where the outage began - up until when the source drive (Drive 0) failed, all servers in this cluster were online and functional.
How Did We React?
Working with our hardware vendor's tier 4 engineers, we were able to coax both Drive 0 and Drive 6 back into active, albeit degraded, statuses. That gave us 3 degraded drives:
* Drive 0 and 6 in a degraded state
* Drive 10 because of corrupted data
Fault tolerance thresholds require no more than 2 degraded drives, so we removed Drive 10, convincing the controller that there were only 2 degraded drives (Drives 0 and 6) which satisfied fault tolerance thresholds and allowed it to resume operations. This got the SAN back online, and we were able to start transferring your data from the affected volumes to stable ones.
What's Taking So Long?
Due to the very fragile nature of the array and it's dependency on Drive 0 remaining active (which may crash anytime, causing a loss of all data on the array), our hardware vendor's senior tier 4 engineer stated that the evacuation process must be handled one volume (one server) at a time, in sequence. At this point, additional engineers or hardware cannot influence the speed of the recovery process. Evacuating multiple volumes would make the process go faster but could result in another fault and possibly data loss.
What's Next?
Over 68% of the servers initially affected by this issue are already back up and running. However, we're still working around the clock to get the remaining servers back online. I urge you to check the status blog for updated ETAs (linked above). Once we're 100% up and running, we will be conducting an in-depth investigation and publishing a post-issue report on our main blog. This report will include the root cause of these issues, as well as documentation outlining the steps we'll be taking to prevent any future incidents of this nature.
In the meantime, I've been personally reading every single one of your comments on the blogs, and can only offer my utmost apologies for this terrible situation.
I know that waiting for your servers to come back up is aggravating, but please trust me when I tell you we're doing everything we can to get your servers back up and running as quickly as possible.