Platform - Outage
Incident Report for Autobooks
Postmortem

Summary:

On March 28th, 2021 from 2:38ET to 13:30ET Autobooks experienced an outage with our primary cloud service provider that resulted in error pages and/or request timeouts for all customers interacting with the Autobooks platform.

Details:

Autobooks utilizes many premier cloud partners as part of providing our SaaS platform, including Microsoft Azure.  Microsoft Azure is used to securely host the Autobooks platform in redundant datacenters across separate geographic regions to ensure the platform is available in the event of any sort of internet outage affecting a significant portion of the internet.

At 2:05ET on March 28th our team began to perform scheduled maintenance as part of routinely testing the BC/DR (business continuity / disaster recovery) capabilities of the Autobooks cloud platform.  Many of these capabilities are exercised as part of our standard bi-weekly software deployment cycle, but we routinely like to exercise the entirety of these capabilities in a controlled environment as part of standard operating procedure.

At 2:08ET, our team verified that the cross-region database failover that was being tested as part of this exercise finished as planned.  The team then spent the next thirty minutes evaluating logs and monitoring services to ensure everything worked as expected.  Once verified successfully, our team initiated a failover to return the platform back to its pre-test configuration.

At 2:38ET, our team determined that the cross-region database failover back to its original configuration was stuck in an undesirable state, leaving both the primary and secondary instances read-only.  This read-only scenario is initiated to ensure the integrity of the underlying data, but ultimately leaves the Autobooks platform in a similarly read-only or otherwise unusable state.  At this point our team engaged Microsoft support directly for assistance in remediating the issue.

From 2:38ET to 13:30ET, our team worked closely with the Microsoft support team to identify the root cause of this unexpected failure state, which was ultimately related to underlying security certificates that Microsoft uses as part of geographic failover to ensure the privacy and integrity of the underlying data.  These certificates were in an inconsistent state due to unrelated maintenance that Microsoft was doing as part of their own standard security certificate rotation procedures. Unfortunately, this root cause was specific to the underlying Microsoft Azure infrastructure in a way that was invisible to the Autobooks team and was not obvious to the Microsoft support team at that time which led to the longer than acceptable issue duration.

Since that time, we’ve worked with Microsoft to better understand what caused the issue, and what we can jointly do to ensure it doesn’t happen again.  Microsoft is working to improve this certificate rotation process to prevent similar issues going forward.  The failover to the secondary region and subsequent failover back in short order is what triggered the issue on the Microsoft side, so in the interim we’ll avoid that scenario as part of our testing.

We apologize for any inconvenience this may have caused. Your business is very important to us, and we strive to provide you with exceptional service.

Posted Apr 06, 2021 - 16:05 EDT

Resolved
This incident has now been resolved. Our teams have completed monitoring and we have confirmation from our Cloud Provider, Microsoft Azure, that all systems have been fully restored.
Posted Mar 28, 2021 - 14:55 EDT
Update
We are continuing to monitor for any further issues.
Posted Mar 28, 2021 - 13:58 EDT
Monitoring
The incident has been resolved with our cloud provider, Microsoft. We will continue to actively monitor the incident.
Posted Mar 28, 2021 - 13:40 EDT
Identified
During routine failover testing, we identified an issue in which we have engaged our cloud provider, Microsoft. We have an open severity A request with our provider in which we are actively working to resolve.
Posted Mar 28, 2021 - 12:41 EDT
Update
We are continuing to investigate this issue and will provide further updates as they become available.
Posted Mar 28, 2021 - 10:30 EDT
Update
We are continuing to investigate this issue and will provide updates as they become available.
Posted Mar 28, 2021 - 07:44 EDT
Update
We are continuing to investigate this issue.
Posted Mar 28, 2021 - 05:31 EDT
Update
We are continuing to investigate this issue.
Posted Mar 28, 2021 - 03:49 EDT
Investigating
We are currently investigating an issue which affects the Autobook's platform. We are working to resolve the issue as quickly as possible, and we apologize for any disruption this may cause.
Posted Mar 28, 2021 - 03:13 EDT
This incident affected: Autobooks Platform (Web App).