An official Root Cause Analysis (RCA) from Microsoft indicates the company received an “anomalous surge” of DNS queries. The sheer volume was enough to bring down the majority of services Microsoft offers on the Azure Cloud.
Many users couldn’t access the majority of Microsoft services because of “a code defect that allowed the Azure DNS service to become overwhelmed.” Services simply couldn’t respond to DNS queries, a phenomenon commonly caused by a DNS DDoS attack.
Microsoft hints DNS DDoS may have caused services to become unreachable this week:
The majority of Microsoft services became inaccessible for a few hours earlier this week. The global outage prevented users from accessing or signing into numerous services, including Xbox Live, Microsoft Office, SharePoint Online, Microsoft Intune, Dynamics 365, Microsoft Teams, Skype, Exchange Online, OneDrive, Yammer, Power BI, Power Apps, OneNote, Microsoft Managed Desktop, and Microsoft Streams.
The outage was so bad that even the services’ Azure status page remained inaccessible. Microsoft eventually resolved the outage at approximately 6:30 PM EST. However, quite a few services took longer to come back completely online.
— MSPoweruser (@mspoweruser) April 3, 2021
At the time, Microsoft had brushed off the occurrence claiming a DNS issue had caused the outage. The company has now offered an RCA for this week’s outage, and it indicates something worse.
Needless to mention, the RCA clearly hints the Azure DNS service was overloaded with DNS queries, rendering the middle-man useless. The Microsoft Azure DNS is a global network of redundant name servers that claims to provide high availability and fast DNS services.
Microsoft hasn’t revealed who was responsible for the attack. However, the sheer size and scope of the attack are quite huge. In other words, even a concerted DDOS attack should not have been able to bring down a massive cloud service such as Microsoft Azure.
What steps has Microsoft taken to prevent the reoccurrence of the Azure DNS DDoS?
The attack revealed a flaw in how Microsoft implemented its DNS Edge caches. The flaw may have allowed the DDoS (Distributed Denial-of-Service) attack.
“Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches,” Microsoft explained in the root cause analysis for the outage.
“As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.”
Alternative summary: Microsoft's global service outage this week caused by an Azure DDoS DNS cascading failure. https://t.co/EjHkp2k5mM
— Kenn White (@kennwhite) April 3, 2021
This basically means Microsoft was overly dependent on Azure DNS services to resolve the majority of its domain addresses. As the Azure DNS services remained unavailable, so did the services that depended on the same.
Microsoft has reportedly stated that they are repairing the code defect in Azure DNS. However, the company’s solution apparently involves boosting the ability to handle a large number of simultaneous requests.
Microsoft has additionally indicated that it plans on improving the monitoring and mitigations of anomalous traffic. However, some experts claim the company must deploy additional measures.