CXone Mpower Expert - Service Degradation: US Sites unavailable

Incident Report for CXone Expert US

Postmortem

Impact Start Time (UTC) 12/30/2025 08:01 PM UTC
Impact End Time (UTC) 12/31/2025 05:10 PM UTC

Incident Summary
Updated on 01/07/2026 - On 12/30/2025, some NiCE CXone Mpower customers reported experiencing intermittent latency and "503 Service Unavailable" error message when utilizing the CXone Mpower Expert knowledge portal. The root cause was sustained Central Processing Unit (CPU) saturation within a frontend component under production traffic, which degraded responsiveness and caused backend unavailability as observed on the load balancer. The impact was resolved by scaling up the frontend component to increase the CPU limits and request-handling capacity.

Root Cause
The root cause was sustained CPU saturation within a frontend component under production traffic, which degraded responsiveness and caused backend unavailability as observed on the load balancer. While scaling actions temporarily improved stability, the persistence of the issue indicated that the web component could not efficiently handle workload demand under certain traffic patterns, resulting in intermittent service disruption. Engineers confirmed no unusual customer usage spikes, and log reviews revealed no
clear indicators for the CPU saturation. This incident highlighted key areas for improvement and led to preventive action plans focused on improving auto-recovery capabilities and enhancing logging mechanisms to support faster and more effective investigations of similar issues in the future.

Corrective Actions

Detection

Internal support received critical alerts indicating a potentially customer-impacting issue, which was later confirmed through customer reports indicating intermittent latency and "503 Service Unavailable" error message when utilizing the CXone Mpower Expert knowledge portal.

Remediation

The impact was resolved by scaling up the frontend component to increase the CPU limits and request-handling capacity. Completed on 12/31/2026.

Prevention

The Engineering team will enable additional logging mechanism to capture detailed events on the frontend component. This enhancement will support more effective investigations and help determine the actual issues involving this component. An update will be provided by End of Day (EOD) MT on 01/23/2026.

The Engineering team will update the autoscaling configuration for the frontend component to increase the maximum limit. This adjustment will enhance the auto-recovery mechanism and prevent the likelihood of similar impacts during CPU saturation incidents. An update will be provided by EOD MT on 01/23/2026.

The Engineering team will optimize the runtime process model and limits to better match the CPU and memory resources allocated to the web container. An update will be provided by EOD MT on 01/23/2026.

Incident Timeline (UTC)

12/30/2025 08:01 PM (UTC) - Engineers observed telemetry indicating possible customer impact.
12/30/2025 08:14 PM (UTC) - Engineers notified the Network Operations Center (NOC) engineers about the reported customer impact; a major incident
was proposed and confirmed. 12/30/2025 08:15 PM (UTC) - The first customer case was reported, which was confirmed related to the ongoing major incident.
12/30/2025 08:20 PM (UTC) - The system recovered and internal test validations were successful.
12/30/2025 08:42 PM (UTC) - Another spike in CPU utilization was observed, which normalized at 09:06 PM (UTC).
12/30/2025 09:30 PM (UTC) - Engineers performed a horizontal scaling up of pods.
12/30/2025 10:32 PM (UTC) - Engineers observed the issue reoccurred with multiple spikes in utilization. Vertical scaling to scale up the pods was performed, which restored the service.
12/30/2025 11:48 PM (UTC) - The CPU utilization returned to normal and the service was restored.
12/31/2025 12:05 AM (UTC) - Engineers started seeing pods failing to start up and immediately performed remediation actions.
12/31/2025 12:10 AM (UTC) - Engineers reported that they were still seeing intermittent alerts, but the sites were already up.
12/31/2025 03:58 AM (UTC) - While engineers were closely monitoring the platform, another short CPU spike was experienced. Engineers continued their platform evaluations.
12/31/2025 09:10 AM (UTC) - Another spike in utilization was observed, which only lasted for few minutes. Engineers continued close monitoring.
12/31/2025 12:40 PM (UTC) - Telemetry indicated a short spike in utilization but automatically recovered and the sites remained working.
12/31/2025 04:14 PM (UTC) - While engineers continued to monitor and troubleshoot, another recurrence of spike in utilization.
12/31/2025 05:10 PM (UTC) - The impact was resolved when engineers increased the CPU limits as they were reset during the redeployment of a component as part of remediation action. Following successful test validations, the major incident was marked as resolved.

Posted Jan 07, 2026 - 22:41 UTC

Resolved

A fix has been implemented and we are monitoring the results

Posted Dec 30, 2025 - 20:40 UTC

Update

We are continuing to monitor for any further issues.

Posted Dec 30, 2025 - 20:36 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 30, 2025 - 20:19 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 30, 2025 - 20:15 UTC

Update

We are continuing to investigate this issue.

Posted Dec 30, 2025 - 20:11 UTC

Update

We are continuing to investigate this issue.

Posted Dec 30, 2025 - 20:10 UTC

Investigating

CXone Mpower Expert Service Degradation: Sites US unavailable. The CXone Mpower Expert Engineering team is investigating reports of site unavailability.

Posted Dec 30, 2025 - 20:10 UTC

This incident affected: Application (General Service), Search, In-Product Contextual Help, Email Services, MindTouch Success Center, Analytics, and Geoblocking for Russia.