MindTouch Service Degradation: Sites unavailable
Incident Report for MindTouch
Postmortem

Customer Business Impact:

From approximately 2019-08-31 12:50:00 UTC until 2019-08-31 15:17:00 UTC, all MindTouch sites experienced a service disruption. The spell check service continued to not function until 2019-08-31 19:25 UTC.

Problem Summary:

A failure of an Availability Zone (a single AWS data center location) caused sites to return a maintenance error page.

Most AWS services used by MindTouch are configurable to work across multiple Availability Zones to increase reliability. One caching service, which is used by our middleware components, is limited to a single Availability Zone. Due to this limitation, the Availability Zone failure caused our middleware components to fail. The middleware failure prevented customer sites from loading, and instead returned a maintenance page.

The configuration for the Spell Checker service is limited to a single Availability Zone, therefore the spell check service was similarly affected.

Recovery:

To address the failing middleware components and spell check service, new servers were launched in a different Availability Zone to replace the faulty ones.

Root Cause Summary:

The middleware relies on a caching service that is limited by AWS to a single Availability Zone and does not support automatic failover. Once the caching service became unavailable due to the Availability Zone failure, the MindTouch middleware components were unable to store- and access session data, and became unresponsive.

The spell check service is configured to run on a single Availability Zone due to the licensing model provided by the service. This caused the service to fail due to the Availability Zone failure.

Corrective actions:

MindTouch engineering is investigating options to increase reliability, by changing how we rely on the services that do not support multiple Availability Zones:

  • Change the way middleware components store- and access session data, so that even when the caching service is down, there is not a system wide failure.
  • Explore an alternative to the current licensing model of the spell check service, so it can be available across multiple Availability Zones.
Posted 11 days ago. Sep 04, 2019 - 21:41 UTC

Resolved
The MindTouch Engineering team has resolved the issue and is reviewing the incident.
Posted 15 days ago. Aug 31, 2019 - 16:27 UTC
Update
We are continuing to monitor for any further issues.
Posted 15 days ago. Aug 31, 2019 - 15:19 UTC
Monitoring
MindTouch sites are coming back online. The MindTouch Engineering team will continue to monitor the status of the sites.
Posted 15 days ago. Aug 31, 2019 - 15:18 UTC
Update
We are continuing to investigate this issue.
Posted 15 days ago. Aug 31, 2019 - 14:22 UTC
Investigating
MindTouch Service Degradation: Sites unavailable. The MindTouch Engineering team is investigating reports of site unavailability.
Posted 15 days ago. Aug 31, 2019 - 14:02 UTC
This incident affected: Application (General Service).