CXOne Expert degraded performance: Slow load time
Incident Report for MindTouch
Postmortem

Customer Business Impact: A spike in PDF requests caused server latency.

Details: On 20231221 between 15:20 – 15:40 UTC all sites experienced increased latency and 50x errors.

Corrective Actions:

CXone Expert Engineering adjusted the alerting threshold to be more sensitive to catch these types of traffic patterns sooner.

CXone Expert Engineering also added tasks to our backlog to create some helper scripts to respond more quickly. In addition, we have created documentation for off-hour support to help us remediate these issues.

We are still working on an item on our engineering maintenance board to implement more advanced rate limiting on a per-api-endpoint basis. This will help maintain consistent performance for endpoints that are slow or CPU heavy (pdf, llm/kernels, search, etc).

DevOps should be rolling out a change in the next week or two which will offload PDF traffic to its own cluster while we work on more permanent solutions to easing volumetric attacks such as this.

Posted Jan 06, 2024 - 00:16 UTC

Resolved
This incident has been resolved. An RCA will be released within the next few days reviewing the incident.
Posted Dec 21, 2023 - 18:50 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 21, 2023 - 17:21 UTC
Identified
The issue has been identified and a resolution is being worked on.
Posted Dec 21, 2023 - 17:11 UTC
Investigating
CXOne Expert Engineering is looking into issues related to slow load times and sites not loading.
Posted Dec 21, 2023 - 15:24 UTC
This incident affected: Application (General Service), Search, In-Product Contextual Help, and MindTouch Success Center.