Platform Outage
Incident Report for Autobooks
Postmortem

Summary: 

On April 1st, 2021 from 17:26 ET to 18:45 ET the Autobooks platform experienced an outage with our primary cloud service provider that resulted in service failures for our customers. 

Details: 

Autobooks utilizes many premier cloud partners as part of providing our SaaS platform, including Microsoft Azure.  Microsoft Azure is used to securely host the Autobooks platform in redundant datacenters across separate geographic regions to ensure the platform is available in the event of any sort of internet outage affecting a significant portion of the internet. 

At 17:26 ET on April 1st our automated monitoring began to alert the team to potential issues affecting the Autobooks platform.  Our team quickly determined that the issue was related to failing DNS (Domain Name Systems) queries that are both used by external parties to connect to our system as well as for internal components within the platform to connect to each other.  

At 17:35 ET, our team determined that the issue was likely an Azure-wide DNS outage as multiple public components related to Microsoft Azure were also unavailable, which included the Azure public status page that would generally be used to confirm this theory.  Shortly thereafter Microsoft’s Twitter account confirmed this as a global DNS outage and that they were working to remediate the issue.  From this point forward our team was engaged to monitor the situation and prepare to test our services as soon as Microsoft resolved the issue on their end. 

At 18:45 ET, Microsoft stated that the issue was resolved, and our team verified there were no further issues affecting the Autobooks platform. 

Microsoft communicated the following as the official root cause: 

“Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.” 

Since that time, Microsoft has worked to repair the code defect that prevented these requests from being cached and improved the automatic detection and mitigation technology that is used to minimize the impact of these sort of issues. 

We apologize for any inconvenience this may have caused. Your business is very important to us, and we strive to provide you with exceptional service.

Posted Apr 20, 2021 - 14:02 EDT

Resolved
This incident has now been resolved. Our teams have completed monitoring and have confirmed that all systems are fully restored.
Posted Apr 01, 2021 - 19:45 EDT
Monitoring
We have confirmed that Microsoft has resolved the issue and all Autobooks systems are 100% operational at this time.

We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.
Posted Apr 01, 2021 - 18:47 EDT
Identified
We have identified an issue with our global cloud provider Microsoft Azure and are working with Microsoft support to resolve the issue.
Posted Apr 01, 2021 - 17:54 EDT
Investigating
We are currently investigating an issue which affects the Autobook's platform. We are working to resolve the issue as quickly as possible, and we apologize for any disruption this may cause.
Posted Apr 01, 2021 - 17:49 EDT
This incident affected: Autobooks Platform (API, BackOffice, Web App).