Facebook Outage – Insights from an IT Infrastructure Engineer

It wasn’t just Facebook that went down – everything Facebook owns went dark; Instagram, WhatsApp, and Oculus. It was first assumed that the outage was due to an issue with someone messing up the Domain Name System (DNS). The DNS converts domain names, such as google.com, to their associated IP addresses – which are used for communications. The DNS is there to make it easier for us to access the websites that we want to visit, and the IP address is what the computer needs to deliver the communication from the website. However, it was later discovered that the situation was something much worse. It is now suspected that the outage was due to a misconfiguration of the Border Gateway Protocol (BGP), which is why users were unable to access Facebook’s network remotely.

The BGP is the routing protocol that is used for the Internet and enables communications across networks – it allows one network (Facebook) to communicate its presence to other networks. When the BGP was misconfigured, there was no entry for Facebook’s network, so other networks couldn’t find it and it was made unavailable to users. At the moment, it is unknown whether this is outage was caused accidentally or maliciously, but more facts should come to light over the next few days and weeks.

One of the reasons why the outage lasted for as long as it did was because the misconfiguration of the BGP also affected Facebook’s physical door access systems – which shut down; meaning engineers couldn’t get into the buildings, or secure rooms, to start fixing the issues straightaway. The nature of the problem meant Facebook would have needed network engineers to physically access their BGP routers – and due to the pandemic, some of the data centres quite possibly don’t have an engineer based on site, or someone who could have immediately started to work on the problem.

 

Leave a Reply

Your email address will not be published. Required fields are marked *