Microsoft 365 Outage: The Full Story (PIR Follow-up)

A few days ago, I posted about the major Microsoft 365 outage that hit on January 22-23, affecting Exchange Online, Teams, SharePoint, OneDrive, and multiple admin portals. At the time, services were just coming back online, and we were waiting for the full Post-Incident Report.

Well, Microsoft has now published their preliminary PIR (Incident ID: MO1221364), and there's a lot to unpack... Let me walk you through exactly what happened, the technical details behind the outage, and what Microsoft is doing to prevent this from happening again.

Quick recap of what we knew then:

  • Outage lasted from 8:33 PM to 6:00 AM (GMT+1)
  • Root cause: Elevated service load during planned maintenance
  • Impact: Global, affecting millions of users
  • Status: All services restored

What the PIR reveals:

What Went Wrong?

On Thursday, January 22, 2026, starting around 5:45 PM UTC, users started experiencing serious issues with Microsoft 365 services. The main primary impacts? Exchange Online and Microsoft Teams were hit the hardest.

Here's what users were dealing with:

Exchange Online Issues

  • Email delivery completely borked: Sending and receiving external emails just... stopped working
  • Error messages like 451 4.3.2 temporary server issue started popping up
  • Later on, even worse errors showed up: 5xx Host Unknown - Name server: Host not found, Domain Not Found
  • Message trace collection (you know, the thing admins use to troubleshoot email issues) also failed

Microsoft Teams Problems

  • Can't create new chats, meetings, teams, or channels
  • Adding members to teams? = Nope.
  • Breakout rooms in meetings? = Not happening.
  • Teams live events? = Forget about it.
  • Search functionality was down
  • Presence indicators (the little green/yellow/red dots) weren't working

Microsoft Defender

Users couldn't access admin portals, the quarantine portal was unreachable, and Safe Links protection was compromised. Not great when you're trying to keep your organization secure...

Other Affected Services

The ripple effect hit a bunch of other services too:

  • Microsoft 365 admin center
  • Defender for Cloud Apps
  • Microsoft Fabric
  • Power Automate
  • Microsoft Purview
  • OneDrive for Business
  • SharePoint Online
  • Universal Print
  • And more...

The Root Cause: Deeper Than "Elevated Service Load"

In my initial post, I mentioned the root cause was "elevated service load during planned maintenance on North American infrastructure." Now we have the full technical story, and it's a textbook example of how cascading failures work in distributed systems.

Here's what actually happened:

  1. The Cheyenne datacenter went offline at 5:45 PM UTC (planned maintenance)
  2. Traffic got redirected to other Global Location Service (GLS) load balancers in the region
  3. The load balancers couldn't handle the sudden influx - they essentially choked and went into an unhealthy state
  4. Retry storms kicked in - when services couldn't connect to GLS, they kept retrying, amplifying the problem exponentially
  5. DNS resolution failed - GLS provides DNS services, so when it went down, email delivery broke
  6. Cascading failure across services - Exchange Online, Teams, and dependent services all started failing

It's like pulling one card from a house of cards, except the card you pulled was actually holding up the entire structure.

The Timeline Confirmed

The PIR confirms the outage timeline: 5:45 PM UTC on January 22 to 5:00 AM UTC on January 23 (8:33 PM to 6:00 AM GMT+1).

Here's the condensed version of Microsoft's 11-hour recovery effort:

Initial Impact (5:45 PM - 7:30 PM UTC)

  • Cheyenne datacenter goes offline as planned
  • Within 13 minutes, email delivery starts failing
  • Microsoft confirms the issue and identifies GLS service problems
  • DNS resolution issues detected at the GLS level

Mitigation Phase (7:30 PM - 11:30 PM UTC)

  • ATM routing changes implemented to shift traffic
  • Hotfix deployed to address CPU overload and DNS issues
  • Traffic profiles reshaped to reduce load
  • F5 load balancer forced to standby, shifting traffic to passive device (this was the breakthrough)

Recovery Phase (12:00 AM - 5:00 AM UTC)

  • Endpoints brought back online with minimal traffic
  • Additional routing changes to absorb excess demand
  • Separate DNS profile established for better control
  • Traffic reintroduced datacenter by datacenter
  • Full service restoration confirmed at 5:00 AM UTC

Total incident duration: ~11 hours and 15 minutes

Microsoft's Response Plan

Microsoft has outlined concrete action items with committed timelines:

Already in Progress:

  • Updating Standard Operating Procedures for Azure regional failures to improve response times
  • Adding traffic isolation safeguards with more granular analysis to prevent retry storms
  • Implementing a caching layer to reduce GLS load and provide redundancy
  • Automating the manual traffic redistribution processes used during this incident
  • Improving internal communication workflows to identify impacted services faster

Target: March 2026:

  • Adjusting service timeout logic to reduce load during high-traffic events
  • Adding infrastructure capacity to handle similar regional failures

Key Takeaways

This incident demonstrates how complex distributed systems can fail:

Cascading failures are real - One component failure triggered a domino effect across multiple services

Retry storms amplify problems - When services automatically retry failed requests, they can make things exponentially worse

Load balancers have limits - Total capacity isn't enough if load balancers can't handle sudden traffic shifts

DNS is critical infrastructure - When DNS resolution breaks, everything downstream breaks

For IT Admins: If you experienced residual issues, Microsoft recommended clearing local DNS caches or temporarily lowering DNS TTL values. This is also a good reminder to have monitoring for external service dependencies and document your incident response procedures.

The Bottom Line

This outage impacted organizations globally, not just in North America where the infrastructure issue originated.

What matters with incidents like this:

  • Response speed - Microsoft had engineers on it within minutes
  • Transparency - This PIR gives us the full technical story
  • Prevention - The action items look solid, but execution is what counts

Microsoft identified the issue within an hour, worked through the night to fix it, and published a detailed PIR within days. The response was solid overall.

The real test? Following through on those March 2026 deadlines for capacity and timeout improvements.

Hope everyone made it through alright! 😃


Based on Microsoft's Preliminary Post Incident Report for MO1221364. A final PIR will be published within five business days.

Read more