CXOne MPower Expert - Service Degradation: Sites unavailable

Incident Report for CXone Expert US

Postmortem

Impact Start Time (UTC) 03/20/2025 01:46 PM UTC
Impact End Time (UTC) 03/20/2025 02:01 PM UTC

Incident Summary
Updated on 03/26/2025 - On 03/20/2025, some CXone Mpower customers reported being unable to access the CXone Mpower Expert knowledge portal, encountering "503" and "504" error messages. The issue occurred during the regular Quality Assurance (QA) site creation process, where rapidly deploying multiple sites through a system script generated an unexpectedly high load, placing excessive strain on infrastructure components. The impact was resolved by restarting the affected services and performing a rolling restart on the affected nodes, restoring services to normal operation.

Root Cause
The issue occurred during the regular QA site creation process, where rapidly deploying multiple sites through a system script generated an unexpectedly high load, placing excessive strain on infrastructure components. As part of a regular deployment process, several QA sites were created to run integration tests before directing customer traffic to the new deployment. However, this
unexpectedly caused frequent reloads of the load balancer, leading to timeouts and unresponsive pages. Ultimately, this triggered alerts and resulted in customer impact.

Although this procedure had never previously caused such an issue, engineers recognized the need to enhance the system to accommodate the growing load in the production environment driven by an increasing number of customers and their utilization. They promptly developed and implemented preventive measures to mitigate the risk of similar incidents in the future.

Corrective Actions
Detection:
Internal support teams detected a potentially customer-impacting issue through proactive alarms and monitoring mechanisms, which was later confirmed by customer reports being unable to access the CXone Mpower Expert knowledge portal, encountering "503" and "504" error messages.
Remediation:
The impact was resolved by restarting the affected services and performing a rolling restart on the affected nodes, restoring services to
normal operation. Completed on 03/20/2025.
Prevention:
Engineering team implemented rate-limiting measures to control the number of QA sites created simultaneously and increased the pause duration between each site's creation, preventing excessive load spikes during such procedure. Completed on 03/20/2025

Risk of Reoccurrence of Impact: Low

Incident Timeline (UTC)
03/20/2025 01:46 PM (UTC) - Internal support teams received potentially customer-impacting alerts and posted a service disruption notification on the Status Health Portal. Simultaneously, the first customer case was opened, prompting Tech Support (TS) engineers to begin their initial validation and troubleshooting investigation, which later confirmed the issue was related to the major incident.
03/20/2025 01:48 PM (UTC) - Engineers proactively raised a major incident while continuing to work on restoring the service.
03/20/2025 01:51 PM (UTC) - Engineers restarted the affected service components, stabilizing the system. They continued to monitor the system’s health.
03/20/2025 02:01 PM (UTC) - After further monitoring and health checks, the impact was confirmed to be fully resolved. Following successful test validations, the major incident was officially marked as resolved

Posted Mar 26, 2025 - 19:18 UTC