Post-Mortem Report: US.CA Server Outage
Incident Duration: Approximately 8 hours
Incident Start: 16:08 PST
Resolution Time: 00:27 PST
Summary of the Incident
On 11/07/2024, our US.CA server encountered an unexpected outage due to multiple hardware failures, including a RAID array failure, motherboard malfunction, and power supply failure of unknown origin. This combination of issues led to a complete service disruption that required mounting new hardware and restoring data from both recent backups and the failed hardware.
Incident Timeline
- (16:08 PST)
Status: Identified
Description: Service Outage Update: US.CA Server experienced a series of hardware failures, including RAID, motherboard, and power supply issues. Immediate actions were taken to diagnose the root cause and begin the restoration process.
Action Taken: A new server was set up to expedite data recovery, prioritizing recent backups from November 3-4 to bring critical services back online.
- (17:38 PST)
Status: Identified
Description: Data extraction from the failed hardware began, with an estimated transfer time of 6-7 hours under optimal conditions.
Action Taken: Data transfer and recovery tasks were initiated, ensuring all affected services would be restored promptly.
- (19:28 PST)
Status: Identified
Description: New hardware was successfully mounted, and configuration and data uploading from the 2024-11-06 backup were underway. Data recovery from the failed hardware continued in parallel to ensure complete data integrity.
ETA: 3 hours for service restoration; 5 hours for full data recovery.
Action Taken: Backup data was actively restored to get services online while continuing efforts to retrieve data from the failed drive.
- (22:45 PST)
Status: Monitoring
Description: Service restoration in progress; websites began to come back online as the data restoration continued.
Action Taken: Team actively monitored and managed the restoration process, ensuring services were fully operational as soon as data was restored.
- (00:27 PST)
Status: Resolved
Description: Full service was restored, with a few sites pending due to specific data restoration needs from the failed hardware. Owners of these sites were contacted to coordinate further recovery efforts.
Root Cause Analysis
The outage was caused by a cascade of hardware failures:
- RAID Array Failure: A critical failure in the RAID array led to data inaccessibility.
- Motherboard Malfunction: This contributed to instability, requiring immediate hardware replacement.
- Power Supply Failure: Compounded the issue, causing the server to become unresponsive.
The combination of these failures required a complex recovery involving new hardware setup and extensive data restoration.
Resolution and Recovery
- New Hardware Setup: A new server was configured to expedite data restoration.
- Backup Restoration: Initial restoration focused on recent backups from November 3-4 to bring core services online quickly.
- Data Recovery from Failed Hardware: In parallel, we continued to recover any remaining data from the failed hardware to ensure all user data was fully restored.
- Communication with Affected Users: Users with specific data needs were contacted individually to arrange further restoration as needed.
Final Status: All services are now fully operational. A small number of sites are awaiting final restoration from the failed hardware, with affected owners notified directly.
We thank all customers for their patience and understanding throughout this incident. Please don’t hesitate to reach out with any further questions or concerns.