Online Data Availability

By Scott Hamilton

Senior Expert Emerging Technologies

I was shocked this week on Tuesday with the major outage at Facebook. This is the third major outage of the mainstream internet service provider in 2021. Ironically the first two were related to the internet domain name routing protocols that caused name lookup services to fail. There are two core suppliers of the service globally and both experienced massive outages earlier this year, taking down Amazon and Google for short periods of time. There was a lot of speculation that a similar issue caused the failure at Facebook, but the providers were quick to rebut the accusations.

The DNS service provider for Facebook noticed a few seconds before the outage that the downstream DNS servers for Facebook appeared to vanish from the network. After the ninety-second timeout, the global DNS systems noticed the disappearance of the domain from the Internet. This triggered automatic systems designed to auction expired domains to place facebook.com up for sale. It did not take long for the error to be corrected on the auction site, because everyone knew Facebook did not let their domain expire, but it was still interesting to see facebook.com for sale, with an estimated value of $1.2 billion. To be clear, that was just for the name registration, not for the company.

The DNS providers knew they were in the clear, but it led them to investigate what exactly went down at Facebook. For all purposes, facebook.com ceased to exist. It was not just down, but completely gone from name service records worldwide. I used to compare DNS to a phone book; you look up the website name in the directory and it gives you the number to reach the site. In reality it is more like an address book; you look up the name and it tells you where to find it on the Internet. What happened to Facebook was like what happened to the lost city of Atlantis. The DNS initially gave directions to facebook.com, but when the network traffic reached the address, the city was gone.

Eventually, the network routing protocols gave up on finding ways to get to facebook.com and marked the network as deleted; this only took a matter of minutes. Upstream DNS servers began deleting facebook.com from their list of known addresses and as more and more systems deleted the address, the Internet itself began to slow down because of the massive number of failed requests for facebook.com. Amazingly the problem did not stop there.

Facebook provides authentication services for thousands of other web services, including their own internal network accounts. These Facebook accounts not only controlled access to internal Facebook systems, but also physical access controls for their datacenters and offices. They were in a strange situation of not being able to get staff into the facilities to work on the repair. It seems Facebook made a critical mistake in maintaining their security and availability procedures; they did not have a backup plan for access to critical systems.

That brings me to the reason for telling their story. If you are running a business, regardless of the size, it is crucial that you understand the importance of writing and maintaining a disaster recovery plan with off-site resources. For example, if you depend on a particular piece of software for your business, you should have a minimum of two computers with the software installed and a backup either in the cloud or on an external hard drive stored somewhere away from the business. This includes if you have everything in the cloud. It is not a good practice, based on experiences from the last year, to depend on a single provider for any resource. If you use Facebook to communicate with your customers, you might want to add another social media platform to the mix. There were tens of thousands of small businesses that lost all customer communications when Facebook went dark. If you use Google Drive, you might also want a copy of your documents on Microsoft OneDrive, or Dropbox. Just remember, just like in physical life, you should always have a spare for important things like toilet paper.

Until next week, stay safe and learn something new.

Scott Hamilton is a Senior Expert in Emerging Technologies at ATOS and can be reached with questions and comments via email to shamilton@techshepherd.org or through his website at https://www.techshepherd.org. You can also follow his channel on rumble at https://rumble.com/c/c-1141721.