By Scott Hamilton
On Sunday, Aug. 30, 2020, the biggest internet outage in history occurred. Just over 3.5% of the internet went down as the result of an incorrect configuration at CenturyLink. This outage brought down not only every CenturyLink Internet customer, but had a trickle effect that took down services at Cloudfare, Reddit, Hulu, Amazon Web Services, Blizzard, Steam, Xbox Live, Discord and dozens more.
The technical nature of the outage involved both firewall and the Border Gateway Protocol (BGP) routing mechanisms which caused the outage to propagate outward, impacting all services connected to CenturyLink. A drop of 3.5% in overall internet traffic was observed during the incident, making it the biggest outage ever recorded. The underlying cause was a misconfigured flowspec rule, caused by a single engineer.
I find this to be a scary situation, when a single person at a single Internet Service Provider (ISP) can cause a wide spread internet outage. So this week I want to talk a little about BGP, which is the underlying mechanism that propagated the error.
BGP is at the core of the internet and how it routes network traffic. In other words, when you visit a website, you send traffic called packets across the internet through a series of routers that tell your computer where on the internet the website resides, as well as the website where your computer resides, so they can communicate with each other. The routers on the internet are constantly changing as ISPs reconfigure systems, routers fail and new routers are brought online. BGP is the mechanism used to communicate these changes with all the other routers on the internet.
A report released in February of 2011 by Max Schuchard, a computer science graduate student, and a group of his classmates was titled, “Losing control of the Internet: using the data plane to attack the control plane.” In this paper, Schuchard describes major security issues with the BGP protocol that would allow an individual with the right knowledge and a set of computers to crash the underlying infrastructure of the internet to a point that it becomes unuseable.
At the time of Schuchard’s paper, he claimed that he could “crash” the internet to the point that it would require hands-on engineers to manually power off all the control routers on the internet and do a hard reset in order to recover connectivity. He needed a mere 250,000 PCs involved in the attack in order to pull it off. This seems like a lot, until you learn that the Mariposa botnet, which was used in order to commit cyber-crimes and was growing up until its discovery in 2008, consisted of 12.7 million compromised PCs spread around the globe. Nothing was done to circumvent this potential attack over the past nine years.
How Shuchard’s attack was designed to work was to create misconfigured routes which would create routing loops in the Internet traffic. BGP is designed to detect the routing loops and remove the longest route creating the loop. However, if enough loops are created simultaneously, the router’s control plane gets overloaded and becomes unresponsive. This causes BGP to see the route as unavailable and activates a new route. The snowball effect is, that in a very short period of time, all the core routers of the internet are saturated with new route configurations until they crash.
In short, this is what happened with the CenturyLink misconfiguration. CenturyLink engineers accidentally created a giant routing loop that started reconfiguring routers on other providers that were directly linked and trusting CenturyLink for routing tables. In a pro-active action, CenturyLink immediately contacted all their Tier 1 peers, connected ISPs and told them to cut their trust relationship and link until the issue was resolved. Without this action, the problem would have propagated in a matter of hours to impact the entire Internet, resulting in a global outage.
What appears to have happened is that an inncorrect flowspec announcement caused all the routers in Centurylink to create a giant loop announcing a brand new set of routes and immediately dropping all the routes. This propagated to the Tier 1 peers and resulted in corruption of neighboring routing tables. The outage took seven hours to fix and resulted in every CenturyLink customer being without service for the entire period. The decision to de-peer from other tier-1 facilities shows just how dangerous the BGP protocol is to the overall stability of the Internet. CenturyLink knew the risks involved in allowing their error to propagate and took the correct action as the mistake continued in a contained manner, requiring a hard reset of CenturyLink’s entire internal infrastructure rather than a hard reset of the entire Internet.
I was surprised to find that this instability has been widely known for nearly ten years, and not exploited to destroy the internet. I was also surprised to learn that this is not the first incident involving BGP and incorrect configurations by Tier 1 providers causing wide-spread outage. Smaller versions of this same issue occur multiple times a year, taking down local regions of the internet. It is highly likely when your internet fails, it is related to BGP. Until next week, stay safe and learn something new.