By Scott Hamilton
Senior Expert Emerging Technologies
Early in the morning local time, June 8, 2021, a major network outage occurred impacting multiple large Internet companies. The impact was at a global level, taking down high-traffic sites including Reddit, Amazon, CNN, PayPal, Spotify, Al Jazeera Media Network, the New York Times and several government websites. The outage occurred during a routing configuration change by the largest routing company, Fastly, Inc.
Fastly did not have any comments in regards to what went wrong during a routine configuration change. The outage lasted anywhere from a few minutes on some websites, to over an hour on others. Fastly runs what is referred to as an edge network, routing website traffic to the nearest host. For example if you type www.amazon.com into your web browser, the first thing that happens is your computer does a Domain Name Service lookup on the name and returns an IP address. It is like looking up a person in the phone book; you don’t call a person by their name, but their number. Web addresses work the same way. In the case of www.amazon.com, the computer gets back the address “220.127.116.11.” This address tells your computer where to request the website.
Fastly comes into play at the stage of the request. When your computer asks 18.104.22.168 for the webpage located there, Fastly looks at your computer’s IP address and location. From this information it determines that you should connect to the copy of Amazon.com that is located in Kansas City, Mo., and routes 22.214.171.124 to the hidden IP address of the Kansas City Amazon web server. This is the piece that broke Tuesday morning, resulting in a very rapid response stating that the web server is down.
I have a theory on what the underlying configuration change was that broke Fastly, and it is because of doing some research as to where each of the sites that were impacted are housed. As it turns out, all the systems that went down are housed in Amazon datacenters. If you recall last week I wrote about Amazon bringing online their new Sidewalk IOT network, and I was quite upset about the fact that they were enabling it on all Amazon devices without consent of the owner. The go live date for Sidewalk was Tuesday, June 8, 2021. I recently read an article about edge routing and IOT networks. To configure the required mesh network for Sidewalk to work effectively, thousands of new routes had to be added to Amazon’s edge network. This is a major network configuration change and I am certain both Fastly and Amazon knew the network would go down during the update. I believe they expected a rolling outage where each region would go offline for a few minutes, but something did not go as planned and they lost all edge routes. Instead of a rolling outage, they wound up with a rolling repair. As routes automatically updated, systems came back on line sequentially over the period of about an hour and a half.
You probably won’t hear about this anywhere else, but the scenario fits like a glove and maybe sometime in the near future, the truth will come out.
Until next week, stay safe and learn something new.
Scott Hamilton is a Senior Expert in Emerging Technologies at ATOS and can be reached with questions and comments via email to firstname.lastname@example.org or through his website at https://www.techshepherd.org.