US.CA Cluster Hardware issue

Incident Report for Networks Status

Postmortem

Post-Mortem Report: US.CA Server Outage

Incident Duration: Approximately 8 hours
Incident Start: 16:08 PST
Resolution Time: 00:27 PST

Summary of the Incident

On 11/07/2024, our US.CA server encountered an unexpected outage due to multiple hardware failures, including a RAID array failure, motherboard malfunction, and power supply failure of unknown origin. This combination of issues led to a complete service disruption that required mounting new hardware and restoring data from both recent backups and the failed hardware.

Incident Timeline

(16:08 PST)
Status: Identified
Description: Service Outage Update: US.CA Server experienced a series of hardware failures, including RAID, motherboard, and power supply issues. Immediate actions were taken to diagnose the root cause and begin the restoration process.
Action Taken: A new server was set up to expedite data recovery, prioritizing recent backups from November 3-4 to bring critical services back online.
(17:38 PST)
Status: Identified
Description: Data extraction from the failed hardware began, with an estimated transfer time of 6-7 hours under optimal conditions.
Action Taken: Data transfer and recovery tasks were initiated, ensuring all affected services would be restored promptly.
(19:28 PST)
Status: Identified
Description: New hardware was successfully mounted, and configuration and data uploading from the 2024-11-06 backup were underway. Data recovery from the failed hardware continued in parallel to ensure complete data integrity.
ETA: 3 hours for service restoration; 5 hours for full data recovery.
Action Taken: Backup data was actively restored to get services online while continuing efforts to retrieve data from the failed drive.
(22:45 PST)
Status: Monitoring
Description: Service restoration in progress; websites began to come back online as the data restoration continued.
Action Taken: Team actively monitored and managed the restoration process, ensuring services were fully operational as soon as data was restored.
(00:27 PST)
Status: Resolved
Description: Full service was restored, with a few sites pending due to specific data restoration needs from the failed hardware. Owners of these sites were contacted to coordinate further recovery efforts.

Root Cause Analysis

The outage was caused by a cascade of hardware failures:

RAID Array Failure: A critical failure in the RAID array led to data inaccessibility.
Motherboard Malfunction: This contributed to instability, requiring immediate hardware replacement.
Power Supply Failure: Compounded the issue, causing the server to become unresponsive.

The combination of these failures required a complex recovery involving new hardware setup and extensive data restoration.

Resolution and Recovery

New Hardware Setup: A new server was configured to expedite data restoration.
Backup Restoration: Initial restoration focused on recent backups from November 3-4 to bring core services online quickly.
Data Recovery from Failed Hardware: In parallel, we continued to recover any remaining data from the failed hardware to ensure all user data was fully restored.
Communication with Affected Users: Users with specific data needs were contacted individually to arrange further restoration as needed.

Final Status: All services are now fully operational. A small number of sites are awaiting final restoration from the failed hardware, with affected owners notified directly.

We thank all customers for their patience and understanding throughout this incident. Please don’t hesitate to reach out with any further questions or concerns.

Posted Nov 08, 2024 - 04:49 PST

Resolved

Service Update: Issue Resolved

We’re pleased to inform you that the issue has been resolved, and services are now fully online.

A small number of sites may still be missing; we will reach out directly to the respective owners to arrange restoration. Rest assured, the data is intact, but it requires additional time to be fully restored from the failed hardware.

Thank you for your patience and support during this process.

Posted Nov 08, 2024 - 00:27 PST

Monitoring

Service Restoration in Progress

The restoration process has begun, and websites are starting to come back online as we speak. Our team is actively working to ensure that all services are fully operational as soon as possible.

ETA: Sites will continue coming online over the next few hours as restoration progresses.

We appreciate your patience and understanding during this process, and we’ll provide further updates here as needed.

Thank you for your continued support.

Posted Nov 07, 2024 - 22:45 PST

Update

We have successfully mounted new hardware and are currently configuring the system and uploading backup data from 2024-11-06. Our team is working to restore access, and we anticipate that services should be back online within a few hours.

ETA: Approximately 3 hours

Data Recovery in Progress:
We are also continuing data recovery from the previous hardware. Once this process is complete, all systems and data will be fully up to date.

ETA: Approximately 5 hours

We appreciate your patience and understanding during this restoration process. Thank you for your continued support.

Next Update: Approximately 3 hours

Posted Nov 07, 2024 - 19:28 PST

Update

We are currently in the process of extracting data from hardware that has encountered a failure.
Our team is working diligently to restore all affected services as quickly as possible.

Estimated Data Transfer Time: Approximately 6 to 7 hours to transfer, assuming ideal conditions.

We appreciate your patience and understanding as we work through this recovery process.
Further updates will be provided here as we make progress.

Thank you for your continued support.

Posted Nov 07, 2024 - 17:38 PST

Identified

Service Outage Update: US.CA Server

We are currently addressing an unexpected outage on our US.CA server caused by a series of hardware issues, including a RAID failure, motherboard malfunction, and power supply failure due to an unknown cause. Our team is fully engaged in resolving this complex situation to restore services as swiftly as possible.

What We’re Doing
To expedite recovery, we are setting up a new machine and have begun the data restoration process. The initial restoration will use recent backups from November 3-4, aiming to get core services up and running quickly. Additional data will be restored gradually over the next few days to ensure complete access.

What to Expect
We anticipate the first phase of restoration to be completed within a few hours. Full data recovery will take longer, but we’ll continue providing updates throughout the day as progress is made.

Thank you for your patience and understanding as we work diligently to resolve these issues. Further updates will follow—please reach out if you have any questions.

Posted Nov 07, 2024 - 16:08 PST

This incident affected: Canada.