The recent Amazon Web Services (AWS) outage has brought up a challenging question for businesses around the world: how much can you really trust “the cloud”? Analysts and industry experts say that while cloud platforms are essential for modern digital operations, many organizations have unrealistic expectations about uptime and reliability. They warn that assuming the cloud automatically ensures resilience is a dangerous misconception that can expose businesses during disruptions.
The Wake-Up Call: When AWS Went Dark
The June AWS outage disrupted key services like Disney+, Slack, Zoom, Robinhood, and Venmo. This event highlighted the fragility of cloud dependencies. Sam Barker, vice president of telecoms market research at Juniper Research, believes businesses still overestimate their cloud providers’ reliability. In a blog post, Barker pointed out that many organizations rely too heavily on a single cloud vendor, often assuming that redundancy exists at every level. “Despite the disruption, Amazon’s stock remained relatively stable, suggesting continued investor confidence,” he wrote. “However, this incident could increase demand for multicloud orchestration, edge computing, and services that improve overall resilience.” Barker expects this outage to mark a significant change. “We expect this event to push businesses to seek new models and solutions that focus on uptime and service continuity,” he added.
Cloud Disruptions Are Normal; Not a Sign of Instability
Though outages make the news, analysts say they don’t indicate that the cloud is “broken.” Lydia Leong, Vice President and Analyst at Gartner, emphasized that cloud disruptions are unavoidable but don’t prove unreliability. “Cloud disruptions happen, but they are not evidence that the cloud is inherently unreliable,” she wrote in a Gartner article. Leong advised against extreme actions like moving back to on-premises servers or switching to smaller “sovereign” clouds. “These moves often introduce new risks and can slow recovery when things go wrong,” she explained. While taking a multicloud approach sounds logical, Leong warned it might backfire: “Pursuing multicloud resilience can cost more than it saves, adding technical complexity without eliminating systemic risk.” Her main takeaway is straightforward: all major providers will experience outages—from AWS to Microsoft Azure to Google Cloud—and the critical factor is how organizations prepare for and recover from these disruptions.
Why Even the Strongest Systems Fail
Shawn Michels, VP of Product Management at Akamai Technologies, noted that the global digital landscape has repeatedly shown its vulnerability, from undersea cable cuts to major cloud platform failures. “A lot of organizations still assume that because something runs in the cloud, it’s automatically resilient,” Michels said. “That’s not the case. Even the largest clouds don’t guarantee perfect uptime.” He emphasized that the key difference is not whether a failure occurs, but the speed at which a system can recover: “You can’t stop every component from breaking, but you can design systems that recover quickly so customers barely notice.” True resilience, Michels said, involves culture and preparation. “The most resilient organizations use phased rollouts, automated rollback systems, and continuous monitoring. They prepare for failure, respond under stress, and learn from every incident,” he explained. “You can’t eliminate all risk, but you can build teams that expect it and adjust quickly.”
Reliability Isn’t Equal Across Providers
Even among the so-called “hyperscalers,” reliability can be inconsistent. Rich Mogull, Chief Analyst at the Cloud Security Alliance, highlighted that not all clouds are the same, and businesses often overlook these differences. “AWS rarely experiences cross-region failures, and when they do, they tend to be limited,” he said. “In contrast, Azure is more likely to have global failures because of its infrastructure design.” These distinctions are crucial for business continuity planning. An organization spreading workloads across multiple AWS regions will likely do better during a localized incident than one relying solely on a single Azure data center. “Enterprises often ignore these differences,” Mogull noted. “Reliability isn’t uniform, even among top providers.”
The Myth of Cloud Immunity
Ensar Seker, Chief Information Security Officer at SOCRadar, echoed this view, arguing that many businesses have fallen into a false sense of safety. “Enterprises often think global cloud infrastructure is immune to downtime because of redundancy,” he said. “In reality, redundancy reduces risk but doesn’t make it disappear.” Seker explained that cloud ecosystems function via complex interdependencies that cover identity systems, DNS services, load balancers, and external APIs. A small problem in one area can ripple through the system, breaking functionality even when the core compute infrastructure is technically “up.” “Cloud outages are unavoidable, not theoretical,” Seker warned. “The real question is how ready your organization is to handle them.” He referenced the AWS outage in June 2023, which impacted everything from hospital systems to banking portals—not because AWS entirely failed but because businesses hadn’t prepared for partial degradation. “The day we have clouds with 100% uptime is the day all problems on earth are solved,” joked John Strand of Strand Consulting in Denmark. “As data centers grow in size and complexity, the chance of something going wrong increases. Old problems will be resolved, but new ones will appear.”
Misreading the Meaning of Cloud Reliability
While some experts argue that businesses overestimate cloud reliability, others think the issue is more about misunderstanding than overconfidence. Sergiy Balynsky, Vice President of Engineering at Spin.AI, shared another perspective. “The cloud isn’t a magic solution. It’s a shared responsibility model,” he explained. Balynsky thinks that outages like AWS’s recent one expose a fundamental misunderstanding. “Cloud providers offer resilient building blocks—regions, availability zones, failover systems—but it’s up to businesses to design for continuity,” he said. “Relying on a single region or skipping redundancy isn’t a failure on the provider’s part; it’s an architectural mistake.” This aligns with modern Business Continuity Planning (BCP) and Site Reliability Engineering (SRE) practices—areas focused on preparing for failure and ensuring uptime under pressure. “BCP and SRE teams spread risk, monitor for failures, and keep critical systems running. That’s where true reliability comes from,” Balynsky added.
Building Resilience Is an Enterprise Choice
Cloud reliability isn’t a fixed quality; it’s something businesses must create and maintain themselves. David Stone, Director in the Office of the CISO at Google Cloud, said customers have plenty of tools to build resilience, but many fail to use them. “Customers can design for resiliency by using multiple data centers in different regions,” Stone explained. “They can deploy across zones and even extend applications across several cloud providers for redundancy.” Similarly, Srini Srinivasan, Founder and CTO of Aerospike, stated that cloud providers already offer the tools for exceptional uptime, provided businesses take advantage of them. “There’s no reason an enterprise can’t reach four nines (99.99%) availability using existing features,” he said. “The mistake is thinking the provider will handle everything for them.” The real issue, Srinivasan emphasized, is mindset. Many companies see the cloud as a finished product rather than a framework for reliability that requires careful design and oversight.
When Scale Creates New Risks
Even large-scale cloud infrastructure has its drawbacks. Aykut Duman, Partner in the Digital and Analytics Practice at Kearney, explained that during the AWS outage, some organizations experienced total downtime despite running workloads across multiple availability zones. “A DNS resolution failure impacted DynamoDB and EC2, showing that reliability relies as much on workload design as on provider infrastructure,” he said. Duman believes many businesses incorrectly assume that provider-level redundancy guarantees uptime. “Resilience must be deliberately built at the application level,” he explained. “Organizations overestimate cloud reliability because they equate scale with safety.” He added that as systems become more interconnected and distributed, cascading failures can arise from seemingly isolated problems. “Reliability is high, but it’s not absolute,” Duman concluded. “The AWS outage demonstrated that being cloud-native doesn’t automatically mean being resilient.”
The Path Forward: Rethinking Reliability
So where does that leave businesses in 2025? Experts agree on a few key shifts in thinking:
– Outages are inevitable. Planning for them is part of a responsible cloud strategy.
– Resilience is shared. Cloud providers supply the tools; businesses must use them.
– Diversity matters. Spreading workloads across regions—and sometimes clouds—reduces risk.
– Culture is key. Resilience relies as much on people and processes as on architecture.
The future of enterprise reliability isn’t about removing risk; it’s about preparing for graceful failure. As more organizations move further into the cloud, the winners will be those that expect disruption, design for recovery, and never confuse availability with invincibility.