Netherlands Datacenter Outage (Bulletproof Location) — Multiple Racks Impacted

Incident Report for Networks Status

Postmortem

Incident ID: AMS1-6f44sjkbj8hx
Date: December 17, 2025
Duration: 5 hours 55 minutes (06:27 – 12:22 Amsterdam Time)

EXECUTIVE SUMMARY
On December 17, 2025, starting at 06:27 Amsterdam Time, UnderHost experienced a significant network outage affecting multiple racks in our Amsterdam (AMS-EQ01) datacenter. The incident involved multiple hardware failures across different infrastructure layers, requiring emergency hardware replacement and configuration migration. All services were fully restored by 12:22 Amsterdam Time.

WHAT CUSTOMERS EXPERIENCED
Customers with services located in the affected racks experienced partial connectivity loss, including:

  • Some servers remained accessible via private VLANs but lost public/WAN connectivity
  • Other servers became completely unreachable
  • Service restoration occurred in stages through a controlled, row-by-row recovery process

TIMELINE OF EVENTS
06:27 – Initial outage detected; multiple racks reported connectivity issues
06:45 – Engineering team engaged; initial diagnosis pointed to network infrastructure
07:15 – Root cause identified: faulty PSU in a shared top-of-rack switch chassis
07:30 – Emergency hardware replacement initiated; affected PSU and management module replaced
08:00 – During recovery validation, an additional issue was discovered in the edge router infrastructure
08:30 – Decision made to accelerate the planned Juniper MX10K migration
09:15 – Configuration adaptation and validation for the new Juniper platform
10:45 – Final row-by-row validation completed
12:22 – All services fully restored

ROOT CAUSE ANALYSIS
This incident involved two distinct but related hardware failures:

  1. Primary Failure (Immediate Impact)
    • Component: Faulty power supply unit (PSU) in an Arista top-of-rack aggregation switch chassis
    • Impact: Caused instability on uplink paths affecting multiple network rows
    • Resolution: PSU and associated management module were physically replaced
  2. Secondary Discovery (Extended Recovery)
    • Component: Defective backplane in an Arista edge router (static routing layer)
    • Complexity: Although deployed in a redundant configuration, the failure mode bypassed redundancy
    • Resolution: Emergency migration to the Juniper MX10K platform, accelerated from the roadmap

WHY RESTORATION TOOK 5 HOURS 55 MINUTES
Several factors contributed to the extended recovery time:

  1. Controlled Recovery Protocol
    • Row-by-row restoration was required to prevent cascading network issues
    • Each row required uplink, VLAN, ARP table, and routing validation
  2. Platform Migration Complexity
    • Emergency migration from Arista hardware to the Juniper MX10K platform
    • Configuration adaptation and validation under time-critical conditions
    • Ensuring compatibility with the existing network architecture
  3. Diagnostic Complexity
    • Multiple hardware failures across different network layers
    • Required systematic isolation to identify and resolve all affected components

CORRECTIVE ACTIONS IMPLEMENTED

Immediate (Completed):
✓ Replacement of the faulty Arista ToR switch PSU and management module
✓ Emergency migration to the Juniper MX10K edge router platform
✓ Full validation of all network paths and routing tables
✓ Enhanced monitoring alerts for power supply and backplane health metrics

Short-Term (Next 30 Days):

  • Comprehensive review of all edge router hardware health
  • Implementation of additional redundancy verification checks
  • Updates to emergency migration procedures for cross-platform scenarios
  • Expansion of spare parts inventory at the datacenter

Long-Term (Network Roadmap):

  • Continued evolution toward a more resilient, multi-vendor network architecture
  • Implementation of automated failover testing for critical components
  • Enhanced monitoring for early detection of similar failure modes

COMMUNICATION REVIEW
We acknowledge that initial communications focused primarily on the immediate top-of-rack switch failure while the broader edge router migration was in progress. Going forward, we will:

  • Provide more frequent updates during extended recovery periods
  • Clearly differentiate between immediate fixes and broader infrastructure changes
  • Maintain transparency when multi-layer issues are involved

LESSONS LEARNED

  1. Multi-Layer Redundancy: Redundancy at one layer does not guarantee protection against failures at another
  2. Platform Diversity: Predefined migration procedures for multiple hardware platforms are essential
  3. Communication Clarity: Multi-stage recoveries require clear, phased communication
  4. Proactive Health Checks: Additional monitoring is required for PSU and backplane health
  5. Emergency Acceleration: Critical incidents can accelerate planned improvements with long-term benefits

APPRECIATION
We sincerely apologize for the disruption this incident caused to your operations. Our engineering team worked continuously and with the highest priority to restore services as quickly and safely as possible. We appreciate your patience and understanding during this event.

CONTACT
If you have any questions regarding this incident or observe any lingering issues, please contact our support team.

Sincerely,
The UnderHost Engineering & Operations Team

Posted Dec 17, 2025 - 07:01 PST

Resolved

This incident has been fully resolved. Services in our Netherlands datacenter are operating normally, and monitoring confirms sustained stability across all previously affected racks.

A detailed post-incident report, including root cause analysis and corrective actions, will be published for full transparency.

If you experience any remaining issues, please contact support with your service ID and affected IPs for immediate assistance.
Posted Dec 17, 2025 - 03:38 PST

Monitoring

Service has been restored across the affected racks in our Netherlands datacenter. Power and network stability have been re-established, and all impacted systems are now reachable.

We are actively monitoring the environment to ensure continued stability and to validate full service recovery. While operations have returned to normal, we will continue close observation before marking this incident as fully resolved.

A full incident report and root cause disclosure will be published once the datacenter investigation is complete.
Posted Dec 17, 2025 - 02:46 PST

Identified

The investigation is ongoing and the issue has been isolated to a power distribution segment within the datacenter affecting multiple racks. Initial diagnostics indicate instability on a shared PDU/UPS path, which triggered protective shutdowns on several nodes to prevent hardware damage.

Datacenter engineers are currently validating power stability and coordinating a controlled bring-up of affected racks. Network connectivity and host availability may fluctuate during this process.

At this time, there is no confirmed ETA for full restoration. Further updates will be provided as soon as additional information is confirmed.
Posted Dec 16, 2025 - 23:41 PST

Update

Partial/Full service disruption for a subset of servers (multiple racks). Customers may see downtime, packet loss, or unreachable services.
Posted Dec 16, 2025 - 22:11 PST

Investigating

We are currently investigating an outage at our Netherlands datacenter impacting multiple racks. Our engineers and the facility NOC are actively working to identify the root cause and restore service. Further updates will be posted as we receive confirmation from the datacenter.
Posted Dec 16, 2025 - 22:10 PST
This incident affected: Netherlands.