Code defect in Microsoft Azure DNS servers leads global outage

Recently, Microsoft has revealed a worldwide Outage caused due to a code defect as a result of which Azure DNS service become overwhelmed and started un-responding to DNS queries.

The global outrage was experienced on Tuesday afternoon approximately at 5:21. Due to this many a user found difficulties in sign in to numerous services. These services include:

  • Xbox Live,
  • Microsoft Office,
  • SharePoint Online,
  • Microsoft Intune,
  • Dynamics 365,
  • Microsoft Teams,
  • Skype,
  • Exchange Online,
  • OneDrive,
  • Yamer,
  • Power BI,
  • Power Apps,
  • OneNote,
  • Microsoft Managed Desktop,
  • And Microsoft Streams.

These services are so widespread within Microsoft’s infrastructure. This is why, the azure page responsible for proving outage info to the users, became inaccessible.

If we talk about its current status, the outage has been resolved. The Microsoft eventually resolved the issue on the same day at approximately 6: 30 PM EST. However, some of the services take a bit time to function again.

On asking for more information on the outage, Microsoft said only this that it was caused due to DNS issue.

Yesterday, however, the company published as RCA or root cause analysis. This explains that the week’s outage is because of Azure DNS services got overloaded.

According to the Microsoft, Azure DNS, which is responsible for providing high availability and fast DNS services, started receiving an anomalous surge of any DNS queries it received all over the world.

While it is not clear what that the anomalous surge is, it may be a DDoS attack targeting certain domains.

 It is because of a code defect, the DNS service that typically handles a large number of requests would not work properly, said the Microsoft.

“Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches.”

“As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service,” explained by Microsoft in the published RCA.

Almost all Microsoft domains are resolved through Azure DNS. Since the DNS service became overloaded, resolving the hostnames on these domains and access associated to the services were not possible.

As for example, xboxlive.com use the following Azure DNS name servers to resolve the hostnames of this domain:

NS1-205.AZURE-DNS.COM

NS2-205.AZURE-DNS.NET

NS3-205.AZURE-DNS.ORG

NS4-205.AZURE-DNS.INFO

So, when the service was unavailable, the users were no longer able to log-in to the X-box Live.

Microsoft is at current repairing the code defect so that the DNS can handle large amount of requests.