The short description: The server backups took longer than expected and as the web traffic started to grow the server got overloaded and crashed.
TIMELINE for Thursday, August 2nd, 2012, 10:15-19:20
- 10:15 The first internal monitoring alert arrived
- 10:17 First customer e-mail arrived
- 10:18 First attempt to log in to the server
- 10:18 Acknowledged the customer e-mail
- 10:25 The first SSH and Remote console login attempts are unsuccessful
- 10:38 Since the remote connection was not possible, the server was rebooted
- 10:40 The server did not start-up after the reboot due to file system consistency errors
- 10:45 MeloTel technicians launch a file consistency check and repair tool
- 13:30 The file consistency repair was finished
- 13:35 Database issues noticed on some accounts
- 13:40 Started to fix database files
- 13:47 Identified the corrupt DB`s
- 14:10 Continued to fix DB files
- 15:45 Noticed that tables that use inno-db engine were corrupted
- 16:50 Made backups of the inno-db corrupted files
- 17:10 Started inno-db files recovery
- 18:57 Finished the recovery and started testing the databases and all websites and CRMs
- 19:20 Finished testing and notified our customers
ROOT CAUSE ANALYSIS
The backups that ran simultaneously with the traffic crashed the server. The backup process was too long and it was extended to the business hours.
CORRECTIVE/PREVENTATIVE ACTIONS
MeloTel has made major improvements to the backup process.