Commentary
IT infrastructure is the invisible engine that drives our world. From apps on phones to the systems powering hospitals, banks and airlines, we rely on technology to keep businesses running smoothly.
But systems sometimes fail. The fragility of even the most advanced technology was exposed in 2024. High-profile outages disrupted industries, left companies scrambling to recover, frustrated consumers worldwide and cost billions of dollars.
Businesses are increasingly exposed to IT risks. That makes dependency assessment and strategy implementation essential to minimise disruptions. Companies must identify systems vulnerabilities and implement diversified IT strategies, including insurance, to mitigate the impact of future disruptions.
Let’s dive into the top 10 IT outages of 2024 to explore what went wrong, the industries affected and actionable steps to better prepare for the next inevitable disruption.
#10. Atlassian, February. Atlassian services, including its popular Jira project management platform, went down due to “database system failover configuration misalignment”. The outage disrupted global project management and developers’ workflows for thousands of businesses, particularly in the tech sector.
#9. AWS Kinesis, July. For almost seven hours, APIs connecting AWS Kinesis, Amazon’s data streaming pipeline service, to customers in its heavily used us-east-1 region slowed down and made more errors. The fault happened when the Kinesis “cell management system” was “flustered by big shards”. Numerous customers were hit, causing widespread impacts particularly on logistics and financial services. The event highlights the need for diversified and resilient IT architecture to reduce reliance on single-point services.
#8. Microsoft Azure AI, July. MS Azure’s OpenAI services experienced a global outage across multiple regions for more than 20 hours, frustrating work for businesses reliant on AI-driven tools. The disruption was caused by a routine operation that inadvertently disabled critical bits of code.
#7. ChatGPT, December. ChatGPT’s AI chatbot was inaccessible for more than nine hours, with users world-wide seeing the message “internal server error”. OpenAI stated that the outage was “caused by an upstream provider”. It also affected Sora, OpenAI’s text-to-video generator. Full functionality was restored when the supplier issue was fixed.
#6. Salesforce, October. Salesforce, the world’s leading CRM provider, suffered a nine-hour outage caused by an API gateway malfunction. The outsized effect of the event hit businesses worldwide badly, as their ability to manage customer relationships and sales pipelines was curtailed.
#5. CDK Global, June. In June, a ransomware group targeted CDK Global, an auto-sector tech provider. The attack disrupted more than 10,000 car dealerships across North America, costing more than $1bn. Attributed to “endpoint security vulnerabilities”, the attack exposed the possible linkage between malicious activity and disruption of the digital supply chain.
#4. Epic EHR, July. Epic EHR went down for around 24 hours, disrupting critical healthcare services. The outage, which was linked to the CrowdStrike failure, delayed critical hospital treatments, forcing a return to manual processes. The event underscores the cascading effects of IT failures in interconnected systems, particularly in life-critical sectors like healthcare, and highlights the need for robust failover mechanisms.
#3. Facebook, March. Facebook users around the world, including an astonishing 11 million who reported the outage, were disconnected for two hours because of a server configuration error. Small businesses that rely solely on the platform for customer interactions were hit, highlighting the vulnerability of single-platform ecosystems.
#2. Microsoft 365, November. A backend issue took down Microsoft 365 for more than 24 hours. The root cause was a change in the system that caused an influx of retry requests routed through servers. The outage impacted users worldwide, with thousands reporting difficulties accessing critical services including Outlook, Teams and Exchange Online. The outage underlines the risk of overreliance on a single centralised system.
#1. CrowdStrike, July. CrowdStrike was one of the largest tech outage events ever. The global disruption occurred when a routine security update was distributed for Falcon, CrowdStrike’s flagship cloud-based cybersecurity product. The software was auto-updated with an incorrect configuration that triggered system failures. Windows machines around the world collapsed to BSOD (that’s the “Blue Screen of Death”). Parametrix estimates the outage cost Fortune 500 companies alone approximately $5.4bn, primarily affecting healthcare and banking sectors. A detailed analysis of the CrowdStrike outage can be found .
Digital infrastructure is vulnerable. Heavy reliance on cloud services is risky. From social media to healthcare systems, the big disruptions in 2024 show that even the largest organisations are not immune to downtime.
Prepare with confidence
• Build resilience: Ensure the continuity that minimises downtime losses with failover systems, reliable backups and rapid response plans.
• Plan proactively: Address potential weaknesses before disruptions occur through regular assessment and strengthening of IT infrastructure.
• Diversify: Avoid overreliance on a single provider by adopting multi-cloud or hybrid strategies to spread risk.
Growing reliance on interconnected IT services increases the risk of money and opportunity lost due to systems disruptions. The insurance industry has new solutions that empower tech teams to address these risks, and to mitigate the financial impact of downtime. Parametric insurance solutions help businesses recover quickly, reduce operational impact and maintain resilience.
What next?
• Understand and map IT exposures: Identify vulnerabilities within your IT infrastructure and address them proactively to reduce risks.
• Acknowledge interdependencies: Recognise how interconnected systems can amplify the impact of a failure and plan accordingly to minimise disruptions.
• Invest in resilience: Strengthen your IT ecosystem with redundancy measures such as failover systems, robust backups and thorough testing.
• Leverage insurance: Mitigate financial risks and ensure faster recovery with Parametrix’s tailored insurance solutions designed for IT outages.
Parametrix is the leading provider of parametric insurance solutions for system interruptions. We make technology financially secure by eliminating the financial risk of systems failures. That empowers enterprises to inspire confidence and foster innovation. Explore how Parametrix can help you today:
Sharon Haran is head of Parametrix Analytics