Facebook outage — reason, effect and how it could have been avoided
At 5:39 PM CET, for some people the Internet ceased to exist. Fortunately, the Internet itself didn’t vanish, but three major applications — Facebook, Instagram and Whatsapp disappeared and were not accessible. People all over the world tried to access the applications without luck. The messenger also stopped responding. Most information on what is going on, was only available on Twitter. Yes, even people at Facebook were posting messages on Twitter, since their status page was affected by the outage as well. This was something so unthinkable, that ISPs (Internet Service Providers) saw a spike in tickets from users, claiming that their Internet connection was broken. Some, those more technically savvy, were rebooting their routers, refreshing mobile operating systems or reinstalling affected applications. After a few minutes it was all clear — the unthinkable happened.
Internet time machine
The outage lasted for about 6 hours, and in the case of Whatsapp application even longer. But was it really something that never happened before? Not really. In its history, Facebook has other major outages. One in 2008 and the other in 2019 that lasted over 14 hours. Those outages affected many people, but how many users were affected in 2021? If we look at the statistics, the numbers are astonishing. There are over 2,8 billion Facebook, 2 billion Whatsapp, 1,3 billion Instagram and 1,3 Messenger users. All these people were unable to connect to these services.
What really happened — non technical
To comprehend what has happened, it is important to understand how the communication between a user and the service provider (in this case Facebook) works. Don’t worry, I won’t go into detail on the Network OSI Model, a framework describing conceptual functions of networking systems, or details on communication protocols used, but I’ll try to explain it in a less technical way.
Let’s put it this way — you want to access some resources on the Internet and have to connect from point A to point B. To get there you need a roadmap. What really happened is that the roadmap has burned to ashes. Even worse, there were no people that could give you directions. Because of that, no one could reach the above mentioned services.
What really happened — technical
From the information that was published on Twitter and Discord, we know that right before the outage there were configuration changes deployed to the routers that were responsible for distributing and announcing the IP address space of Facebook services. To be more precise, the changes were made to the BGP routing protocol configuration that is responsible for distributing the routes. From what we know right now, and from Facebook PR department, that there were no third parties involved in this incident. The changes caused the disappearance of the routing to all Facebook services, including their DNS servers.
To make things even more complicated, it looks like all their configuration tools were also using this address space, so effectively, the Facebook employees could not connect to those tools and reverse the changes. Things got even more bizarre, since their build and data center control systems were also using this address space. At one point, as reported by anonymous Facebook employees, people could not get to the buildings nor data centers because of that.
There were also some logistical problems. Of course, there were some technicians on site that could theoretically revert the changes, but they didn’t have enough privileges to connect to these devices, and people that could authorise them were not on site. There was also one more slight problem — a third group of people that actually understood the problem and had the knowledge to implement changes were unavailable, too. This caused such a delay in restoring the services.
This downtime caused several other problems for people. We shouldn’t forget that some people live from selling their products and services. We might doubt if this is the right business model to be dependent on one platform, but these are the facts. We also saw at one point a 5% drop in Facebook stock prices, which is roughly 50 billion dollars. At one point, some domain trading sites put Facebook.com domain for sale. This is because they only check DNS records, and since it was impossible to check the authoritative DNS Facebook servers, they assumed (wrongly) that the domain is up for sale.
We must always remember that there is no 100% bullet proof system that can’t be broken. Even the most unthinkable can happen. I guess that managers and Facebook gave less than 0,001% chances of such a situation to happen. But it happened. In my long IT experience I saw things happen, that should definitely have not. But there are some things that the Facebook IT department should have done better. First of all, the configuration tools should be on a separate network, independent of general services. If that was the case, the downtime would definitely be much shorter. Secondly, they should go back to their drawing tables, and rethink DR procedures. Definitely, they have something to think about. Thirdly, putting a building authorization system on the same network didn’t help.
There are also lessons learned for the general public. Even though some people used different applications (Messenger and Whatsapp) to keep in touch with each other, they were affected by the same outage. We should really think of totally independent communication platforms, or at least we should know their cellular phone numbers.
Author: Jacek Bochenek, Cloud and Security Team Leader — CISSP, CISM, CCSP