AWS Outage: Timeline, Cause & How to Stay Online

A single AWS Platform service stops working or suffers a severe degradation that can affect your website, cloud services, and other services. In the October 2025 incident (which originated in the US East 1 region and involved a DNS/internal network issue), we saw how even large services can lose connectivity. But here’s the good news. You don’t have to sit idly next time. In this topic, I will show you how to check real-time conditions and what actually went wrong. A response plan that you can replicate.


2) Quick Status & Where to Check Live Issues

When you ask yourself, “Is AWS down?” The quickest way to check is to access the AWS Health Dashboard (official global view + account-level view). According to AWS,

The AWS Health Dashboard provides the latest information on AWS services around the world, as well as the latest trends. It also provides service history for the past 12 months.
because་. step by step

  1. Log in to Global Repository or navigate to your AWS Applications → Health section.
  2. Look for warnings. “The error rate has increased.” “Delay.” “Service Interruption.” For example་. The October 2025 event “indicated an increase in fault rates and latency across multiple AWS services in the US East-1 region.”
  3. Compared to third-party monitors (such as Downdetector). They collect user-side reports.
  4. Subscribe to/Configure Alerts: You can subscribe to SNS or EventBridge alerts. Your account will receive notifications as soon as AWS posts regional events.
    Now, the list of small experiments.
  • Open the AWS Health dashboard. → “Service history” and “Open issue.”
    Check your account-specific health events.
  • Monitor external outage monitoring sites to see if user feedback has increased. (Optional)
  • Confirm whether the incident affects your region, endpoint, or globally.
    What is the difference? Third-party websites show the user side of the diagnostic. The AWS dashboard shows the direction of the infrastructure and control plane. Both provide a clear vision.

3) The Latest Big AWS Outage — Fast Timeline (What we know)

Here is a summary of the October 20, 2025, blackouts.

  • ~07:11 UTC: Reports of high error rates and latency have started to appear for AWS services in the US East-1 region.
  • ~07:55 UTC: Monitoring services (e.g., Thousand Eyes) see widespread service degradation associated with the US East 1 region.
  • AWS identifies a “DNS resolution issue with DynamoDB Regional Server Endpoint” in US East-1.
  • Secondary influence: EC2 instance startup is damaged; Other services that depend on these subsystems are affected.
  • After 15 hours (mid-to-late US day), AWS reported full service recovery. * Impact Services. Apps like Snapchat, Fortnite, and Duolingo on AWS,
    In short, A DNS/resolution failure within US East-1 caused a number of issues in the dependent service that took hours to resolve.

4) Why AWS Outages Happen (Without the Hype)

Yes, let’s get straight to the point. (But, I’ll keep it easy to read.) I’ll cover why cloud providers like AWS might fail, and why cloud providers like AWS might fail. What is meant by “control plane and data plane,” and what does it mean?

Control plane and reference plane

  • Data plane = actual job. Your calculations and query your database. Your API request. Your application traffic.
  • ControlField = Set. management APIs; Health check-up, Infrastructure Configuration, Equipment, DNS resolution

When the control field fails—e.g., Database endpoint DNS resolution—A data field can stop working even if your user ID is correct for example་. AWS said the issue was caused by a DNS resolution failure on the DynamoDB API endpoint in the US East-1 region.

Ceremonial Trap and Arcane Weakness

Suppose your application depends on the “DynamoDB table in US East-1 Region”. But in reality,

  • This table uses DNS to find an endpoint.
  • This limitation has been resolved via the Regional Control Plane service.
    Weight balancers rely on traffic routes and health inspections.
    Some services think of multiple availability zones within the same zone.
    When one component fails (DNS…), everything that depends on it starts to fail. Wired States

“To some extent, proof of service provider operations cannot be sacrificed to reduce costs.”

Why does the U.S. East 1 region continue to blow up?

  • It is the oldest and largest AWS region. (Virginia, USA) and many standard services.
  • Many customers use it as their only zone (or default zone) without realizing it for speed and cost reasons.
  • Power outages in this area have a greater impact. Because world traffic passes through it regularly.
    For example་. Although this outage was entirely centered in the US East 1 region, the consequences are global. ([The Guardian][9]).

Take the keys.

Cloud services are more resilient, but not fail-safe, and your architecture needs to account for the possibility of failure. (especially in control plane subsystems) This is where the phrase don’t put all your eggs in one basket makes sense.


5) Business Impact: Real-World Examples & What It Costs

When I say, “Your business may exist offline,” I mean it. By putting this in a real-world context, you can understand the severity of the situation.

  • October 20, 2025, AWS service outage affecting major apps like Snapchat, Fortnite, Reddit, and more. A report indicates that more than 1,000 services have been affected worldwide.
    It’s not just entertainment apps. Financial services (Bank of England); Airline Registration
    Why is this important to you? Every minute your service goes offline, you lose revenue and customer trust. Imagine a checkout process that your SaaS users cannot log in to. That’s more than the financial cost. It could be your company’s reputation. In a major cloud outage, experts estimate the total loss to affected businesses to be in the hundreds of billions of dollars. As you recover, accumulate more tasks, and more. More support calls. The risk of customer churn increases. Especially for cloud-native services, a single-site architecture that thinks “everything is fine” quickly fails.
    For example་. Many teams chose to use only the US East 1 zone, making their stability plans a single point of failure. For some, the power outages lasted as long as 15 hours.

Conclusion If your application is hosted on AWS and you’re not ready, you’re not ready. The question is: How serious will it affect you?

6) Immediate Response Playbook (Copy-and-Run Checklist)

Here’s the part I really hope you copy and keep handy. When things go south, you want to move fast, calm, and coordinated. I’m giving you a How-To style playbook you can tailor.

HowTo: AWS Outage Incident Response Checklist

Materials needed: your service architecture map, status-page credentials, major stakeholder logins, and comms templates.

  1. Incident declared
  • Person: Incident Commander (IC) designated (can be you or senior DevOps lead).
  • Set channel: Slack/Teams + standby conference bridge.
  • Log start time, known symptoms (error spikes, 5xx, latency, region US-East-1).
  1. Triage & immediate actions
  • Use AWS Health Dashboard → check for open issues in US-East-1 or the region you operate.
  • Check your internal monitoring: Are your availability zones failing? Are error rates spiking? Are 5xx errors increasing?
  • Activate feature flags or circuit-breakers if applicable (so that non-critical features turn off or degrade gracefully).
  • Traffic shedding: If you have global edge services, apply rate limits to reduce load on the affected region.
  1. Communication plan
  • Internal: Brief exec/team—what’s known, where we’re impacted, mitigation in progress.
  • External (customers/users): Publish status update—“We are aware of degraded service; root cause region US-East-1; we are working on mitigation and will provide updates every xx minutes.”
  • Use your status page + social media + email, depending on service size.
  1. Recovery monitoring
  • Once mitigation begins, monitor for improvement: reduced error rates, successful new instances or endpoint resolutions, and traffic recovery.
  • AWS may be throttling new launches/operations—check for backlog. For example, AWS said EC2 instance launches were impaired post-DNS fix.
  1. Post-incident wrap-up
  • Log timeline: when symptoms began, when the root cause was identified, when service was restored, and when full recovery was done.
  • Capture root‐cause summary: e.g., “DNS resolution for DynamoDB endpoint in US-East-1 failed → EC2 launches impacted → NLB health checks degraded”.
  • Identify customer impact: which features/users were offline, how long, and what fallback workflows worked/failed.
  • Publish or archive your post-mortem: what went well, what didn’t, and action items. Bonus: if you publish a summary externally (transparency counts).
  1. Actions/Remediations
  • Immediately review your architecture: Are you region-locked? Do you have a fallback region?
  • Review your monitoring/alerting thresholds: Did you detect the issue early? Was your SLA/RTO met?
  • Update playbooks: Add regions, add fallback endpoints, add comms template.
    Schema hint: Add HowTo schema markup tags for this checklist so search systems can highlight it as a procedural guide.

7) Hardening Guide: Architecting to Survive the Next AWS Outage

Yes, this is an important category. Get yourself ready. If you want your application to survive the next incident. (Yes, it will.) This is where you lay the foundation.

Multiple AZs and Multiple Zones: When each is useful.

  • Multi-AZ (within a single region): Good for availability if one availability zone fails. If the regional control plane (or underlying services such as DNS, NLBs) fails.
  • Multiple regions (two or more different AWS regions): Dynamic: If the US East-1 region fails. Yes་. It’s more complex than that. (data duplication, latency, cost) but for business-critical applications.

Route 53 failed. / Health Check / active-active and active-passive.

  • With Amazon Route 53, you can configure health checks to point to your endpoints. If zone A fails. Route 53 will fail in Area B.
  • Active-Active: Both zones handle traffic at the same time. Your DNS splits traffic (by geography or by weight), and then it’s able to filter traffic.
  • Active-Passive: Area A handles traffic. Area B is on standby. If it fails, you need to switch to Area B. (Manual or automatic) Simple but slow to failure.
  • Example architecture box:
    User → CloudFront edge → Route53 weighted/latency DNS  
       Region A (US-East-1): ALB → EC2 + Lambda + DynamoDB global table  
       Region B (US-West-2): ALB → EC2 + Lambda + replica DB  
       Data replication: DynamoDB global table or S3 CRR  
       Session state: External store (Redis/Memcached) globally via ElastiCache + Global Datastore  

Data resilience: DynamoDB global tables, S3 CRR, RDS cross-region replicas

  • For NoSQL, use Amazon DynamoDB Global Tables so data writes replicate across regions.
  • For object storage: S3 Cross-Region Replication (CRR).
  • For relational: RDS read-replicas in another region or even cross-region WRITES if supported.

Stateless services, externalised session state, idempotency

  • Keep service processes stateless: your app servers don’t hold unique data—they pull from external stores. That way, if the region fails, you can spin up servers in another region quickly.
  • Session state (logged-in user cookies, shopping carts) should use a globally reachable store (e.g., DynamoDB global tables or Redis global datastore).
  • Make your operations idempotent: e.g., payment processing should tolerate duplicate or retried requests, so if your service fails mid-process, a retry is safe.

Caching and graceful degradation patterns

  • Use caching (CloudFront + global caches) so read traffic can continue even if the backend goes down.
  • Feature flag non-critical parts (e.g., recommendations, analytics) so you can “turn off” heavier dependencies in case of failure.
  • Graceful degradation: if region X fails, you serve “read-only” mode, or redirect to a minimal version of the app, rather than full failure.

Cost-aware resilience matrix (tier your workloads)

Not every workload needs a full multi-region. Here’s a simple matrix:

TierBusiness impact if downRecommended resilience
Tier 1Revenue-critical real-time (checkout, auth)Active-Active multi-region
Tier 2Important but non-revenue-criticalActive-Passive multi-region
Tier 3Internal tools, analytics, non-customer-facingSingle region + strong backups
You feel that? Pick what you can afford to invest vs what you can afford to lose—and design accordingly.

8) Observability & Early Warning (What to Watch)

You can’t fix what you don’t understand. Observability is your forecasting radar. Here are some tools you should implement right now.

Helpful notes and warnings

  • API Error Rate (5xx, 4xx) Zone/Availability Zone: 5xx errors suddenly spiked in one zone? red flag
    For example་. The AWS “LaunchInstance” API is taking longer than usual.
    If your service relies on DNS, the
  • NLB/ALB Health Check Failures by Zone: A sudden drop in the number of healthy targets indicates a data flow failure. Using browsers from different geographical locations to access your critical endpoints (inbound, outbound, and APIs), as well as using search engines.

Integration: AWS Health → Event Bridge + Paper

Use the AWS Health event as the source. AWS publishes region/region-specific events when critical issues occur. You can capture them using AWS EventBridge and send them to SNS/email/Slack.
For example་. If “DynamoDB DNS resolution for US-East-1 region” enters the “Open” status.

Multiple Search Web

Use network and regional monitoring (network, mobile, corporate VPN) to ensure you don’t miss regional issues. For example་. If all of your users need to pass through a single zone, you may not know that another area is down. Your users haven’t noticed yet.


9) Case Studies: 2021 & 2023 Incidents — Lessons Learned

Here is what history teaches us. (As the blogger who wrote this, I love hearing about “this has happened before.”)

December 7, 2021

AWS announced a “vulnerability across multiple network devices” in the US East-1 region that caused widespread errors across multiple AWS services. lesson་. Even when the computer is working fine, Digital devices and internal routes can ruin your day.

June 13, 2023 (US-East-1)

AWS Lambda shutdown (in US East-1 region) triggered with increased error rate and latency for function calls. The dependency chain (Lambda → next-generation services and APIs) means that clients using serverless computing are affected.

October 20, 2025

The root cause was A DNS resolution failure at the DynamoDB API endpoint in the US East 1 region, which caused most of the network applications to crash.
General Topics

  • US-East-1 is the losing zone.
  • DNS/decision/control plane dependencies appear.
  • Many customers think “cloud automation = stupid”. But the control plane is still important.
    What did you learn
    Don’t assume that being high in one area equates to overall resilience.
    Practice where your build is in one area but fails in another.
    Prepare your communications and monitoring for these rare but impactful events.

10) Compliance, SLAs, and Stakeholder Reporting

Even if your admin team doesn’t care about “DNS resolution issues in DynamoDB,” They said, “How long have we been down?

SLA/SLO context.

  • SLA: What your provider (or you) promises. Refer to their published SLA for AWS services.
  • SLO: What your team promises to deliver. (i.e., 99.9% of the time).

When there is no electricity, compare the actual time to the SLO and determine your RTO (Recovery Time Objective) / RPO (Recovery Point Objective).

Summary of Benefits

  • Administrative Summary: “On October 20, 2025, the US East 1 region experienced a DNS failure in the AWS DynamoDB control plane, causing approximately 15 hours of service degradation and 2025-2025.
  • Customer Communication: Clear and non-technical. “Some operators have had login issues and delays from 7:10 p.m. to 10:00 p.m.
  • Audit/compliance notes: Time tables, schedules, and schedules. The root cause settle་. Communication and other records must be registered. This should be prepared for any legal review or internal audit.

11) Tooling & Templates (Downloadables)

You know it’s helpful when someone provides a ready-made template. Here are the resources you need to create.

  • Incident Communication Template: Pre-written email + social media post + status page message.
  • Procedure Manual Template: One-page flow chart + checklist (incident leadership → classification → recovery → post-incident analysis) with space to fill in time and actions.
    Your work rate (level 1/2/3); Current site plan
    You can link these resources as downloadable PDFs or Google Docs, allowing users to get real value.
    Search query “AWS crash procedure manual template.” “Communication Patterns of Crashing Incidents.”

12) FAQ

Here are some common questions posted by readers. (Google will show you the answer quickly).
answer་. Is AWS down now?**
Check the AWS Health dashboard. (International + Special Account) If there is a major power outage. AWS will show an “Open” issue for those regions.
Question 2: Which AWS regions had the most outages?**
Historically, U.S. East 1 (Northern Virginia) has had some major events. This is mainly because it is used by many standard services and large networks.
Question 3: What caused the final AWS outage?**
October 2025: A DNS resolution issue for the DynamoDB API endpoint in East 1 USA. This subsequently led to a series of power outages.
Q4. How do I get alerts when AWS encounters issues?
You can also subscribe to the AWS Health event. (via EventBridge or SNS) You can set up private monitoring in your account. Combine this with your synthetic search engine to get insights faster than waiting for users to complain.
Question 5: If a region fails, does that mean my application will fail?**
If you have designed for failure, Multi-site use, and a Copy of the situation DNS failover (Route 53), as well as stateless calculations, if your design supports only one zone. Yes་. You’re hurt.
Q6. What is the difference between Availability Zone (AZ) and Zone Failover?
Availability Zone (AZ) failures are in a zone. You can solve that with multiple AZs. Local failures (or control plane failures within a single location) require a multilocus approach.
Question 7 How much time can be reserved for cloud-based applications?**
It depends on your business. E-commerce applications are likely to target 99.99% (52≈ minutes per year) and are more likely to be targeted. Other internal tools might accept 99.9 percent (8.8 hours per year). Map your business risks to your design.
Question 8: Will AWS publish a full root cause analysis and post-incident report?**

Yes་. AWS maintains a post-event summary (PES) repository for major incidents in each region.

13) Conclusion & Next Steps

In short, the AWS outage in October 2025 shows us what we already know but sometimes forget. If you treat your build as infallible, you’ll be able to do it.
But you can prepare.

  • Intelligently monitor and configure alerts.
    Create multi-region redundancies for your critical workloads.
  • Practice incident response in advance. (Checklist above).
  • Communicate clearly when something goes wrong.
    If you pick one thing to do today, do it. I said, Check if you are dependent on a particular region. (Specifically, US East 1) If so. Your designs are lovely. Start backup planning now.
    You are welcome to share your blackout stories and comments. (We’ve all been there.) Or share how you design your build for stability. If you find this useful, please do so. Please share. Subscribe for more cloud protection guides. Please put your application online anyway.

Leave a Comment