Customer Business Impact:
From approximately 2019-08-31 12:50:00 UTC until 2019-08-31 15:17:00 UTC, all MindTouch sites experienced a service disruption. The spell check service continued to not function until 2019-08-31 19:25 UTC.
Problem Summary:
A failure of an Availability Zone (a single AWS data center location) caused sites to return a maintenance error page.
Most AWS services used by MindTouch are configurable to work across multiple Availability Zones to increase reliability. One caching service, which is used by our middleware components, is limited to a single Availability Zone. Due to this limitation, the Availability Zone failure caused our middleware components to fail. The middleware failure prevented customer sites from loading, and instead returned a maintenance page.
The configuration for the Spell Checker service is limited to a single Availability Zone, therefore the spell check service was similarly affected.
Recovery:
To address the failing middleware components and spell check service, new servers were launched in a different Availability Zone to replace the faulty ones.
Root Cause Summary:
The middleware relies on a caching service that is limited by AWS to a single Availability Zone and does not support automatic failover. Once the caching service became unavailable due to the Availability Zone failure, the MindTouch middleware components were unable to store- and access session data, and became unresponsive.
The spell check service is configured to run on a single Availability Zone due to the licensing model provided by the service. This caused the service to fail due to the Availability Zone failure.
Corrective actions:
MindTouch engineering is investigating options to increase reliability, by changing how we rely on the services that do not support multiple Availability Zones: