MindTouch Service Degradation: Sites unavailable

Incident Report for MindTouch

Postmortem

At 17:42 UTC, DevOps received multiple PagerDuty alerts indicating high latency and error rates for all sites. Investigation showed that our content database instance had run low on memory then failed over to our backup instance. To maintain platform stability during diagnosis the DevOps team implemented mitigation to proactively trigger failover events when available memory fell below a performant threshold. After discovery of an automatic minor version upgrade from AWS that had occurred during their maintenance window on the previous day, the issue was escalated to AWS's technical support team for further investigation. To address issues with memory consumption the instance size was increased, which resulted in a stable level of memory consumption. DevOps has continued monitoring and confirmed the long-term stability of the instance.

Posted Sep 25, 2023 - 20:20 UTC

Resolved

This incident has been resolved.

Posted Sep 18, 2023 - 23:07 UTC

Update

We are continuing to monitor for any further issues.

Posted Sep 18, 2023 - 19:15 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 18, 2023 - 18:41 UTC

Update

We are continuing to investigate this issue.

Posted Sep 18, 2023 - 17:57 UTC

Investigating

MindTouch Service Degradation: Sites unavailable. The MindTouch Engineering team is investigating reports of site unavailability.

Posted Sep 18, 2023 - 17:57 UTC

This incident affected: Application (General Service), Search, In-Product Contextual Help, Email Services, MindTouch Success Center, and Analytics.