Customer Business Impact: A spike in PDF requests caused server latency.
Details: On 20231221 between 15:20 – 15:40 UTC all sites experienced increased latency and 50x errors.
Corrective Actions:
CXone Expert Engineering adjusted the alerting threshold to be more sensitive to catch these types of traffic patterns sooner.
CXone Expert Engineering also added tasks to our backlog to create some helper scripts to respond more quickly. In addition, we have created documentation for off-hour support to help us remediate these issues.
We are still working on an item on our engineering maintenance board to implement more advanced rate limiting on a per-api-endpoint basis. This will help maintain consistent performance for endpoints that are slow or CPU heavy (pdf, llm/kernels, search, etc).
DevOps should be rolling out a change in the next week or two which will offload PDF traffic to its own cluster while we work on more permanent solutions to easing volumetric attacks such as this.