Behind the scenes of an internet outage that took down a considerable chunk of the internet and its aftermath.
Needless to say, our society has become increasingly reliant on the internet for not just work and play, but also life in general. This dependence makes us highly vulnerable to discomfort, in case an internet outage incident happens and a sizable part of the internet goes dark.
Sounds far fetched? Actually, something similar happened recently when Fastly, one of the largest content distribution networks and cloud computing services provider, had an outage that knocked out some of the world’s most frequently visited websites.
The chain of events started just a short while before 11 am UK time on Tuesday 8th June, and within moments, the impact of this major internet outage reached across the globe, when users were unable to access popular news and high-profile websites and started seeing “Error 503 service unavailable” on their screens. No one knew what was going on and why their favourite websites were suddenly unreachable.
It took Fastly more than 45 minutes to identify and rectify the issue, after they shared an announcement with the global community of internet users that a fix was being implemented. Soon after, the sites that were impacted by this outage started coming back online, giving their desperate users a sigh of relief. So, what exactly went wrong and caused such a major incident affecting so many websites at once? To understand this, let’s first understand what a CDN or content delivery network like Fastly is, how it works, and why they are so crucial to keep the internet running as expected.
A CDN or Content Delivery Network like Fastly acts as a service provider to transfer and deliver content hosted at a location closest to internet users that are trying to access it. Without a CDN, it’s difficult for websites to deliver a uniform user experience at expected speeds to all their users, especially to those who are in different locations far away.
For instance, if there’s a website that has all its servers located in London. People who access this website nearer to London will get their content delivered to their screens far more quickly than users who are located far away – say in New York, for example. The latter group of users will find that the same page takes a bit more time to load. This is a big problem, especially for websites that offer video streaming.
In order to resolve this, CDNs bring together a physical infrastructure that comprises of edge servers that sit on the edge of networks, between the cloud and users. This is where data computation ideally needs to happen, at a server location closer to the users as opposed to remote data centres, to be able to deliver the content in the shortest time possible. This way, CDNs are able to pick up the most used pieces of content and cache them closer to the data centres where they are accessed frequently.
While the hashtag “#cyberattack” was trending on Twitter for a while, the reality was quite different. According to Fastly’s Head of Engineering and Infrastructure, they had updated a code in mid-May, which unfortunately had a bug that stayed dormant until Tuesday morning. When one of their customers updated their account settings, it triggered this flaw in their code, which eventually caused 85% of their network to go down and send back errors.
When you configure the servers or update the underlying network commands, it’s a very critical and high-risk job, especially if you are operating one of the biggest content delivery networks in the world. If you make even the slightest mistake in the process, it can trickle down and potentially affect most, if not all servers, at once – and that’s exactly what happened in this case. And this also isn’t the first (or last) time an error has caused such a massive internet outage. In fact, Cloudflare, another key Content Delivery Network (CDN), had an outage in 2020, which had occurred due to a configuration error.
From the likes of BBC and The Guardian, to Amazon and even the UK government’s websites, many high-profile websites went down during this outage. It also impacted Reddit, Pinterest, PayPal, Stackoverflow, Github, Twitch, Shopify, eBay, the New York Times, the Financial Times and many top news organisations, among countless others. This gives you a rough idea about the huge scope of this terrible incident, and why it made it to the news.
The increasing demand for snappy, fast-loading websites and video streaming services with near-zero buffering times comes with a cost, and the cost is to experience similar unexpected outages in times to come. Why? Because the internet as we know it has grown to rely heavily on third-party cloud infrastructure providers, that can take down a major part of the internet, if they experience troubles, as it happened in this latest incident. The more sites that are hosted with a specific CDN, the more likely there will be an event of this nature, taking down far more websites at once, if an outage does happen.
At the same time, the entry barrier for newer organisations in the cloud infrastructure market will keep getting higher, which means the responsibility of keeping the internet accessible and running smoothly for all will go into the hands of a few leading organisations in the market. This is why experts believe it could be worse next time such an outage happens again, and websites that are affected by it will incur a heavy loss, just like Amazon did this time.
Yes, there are ways to avoid this in future, if webmasters could pro-actively host mirrors of their websites in more than one location, depending on where they have the highest number of users. Another possibility is to assign the task of hosting their website’s content to multiple service providers. This way if one of them fails or goes down, another can help avoid an outage and keep things running without a noticeable glitch. Having said this, not many companies are willing to invest in these options.
As we have grown greatly accustomed to having the latest, high-definition content at the tip of our fingers at all times, we have also lost our patience over time, if we look back at how things were in the early internet days. Within just an hour of the outage, a content delivery network, that almost no one outside the industry had ever heard of, brought to light how the global internet infrastructure has become more centralised, dependent, and susceptible than ever. It also told us how we, as internet users ourselves, and our favourite websites too, have become vulnerable to such incidents.