Customer Business Impact: On 20230105 from approximately 17:40 – 17:50 UTC all sites experienced increased latency and 50x errors. At 17:50 UTC full functionality was restored to all sites.
Details: At 17:42 UTC, DevOps received multiple PagerDuty notifications indicating high latency and error rates for all sites.
Recovery: At 17:48 UTC DevOps identified the root cause (missing container image in our container registry). A workaround was deployed to use the previous version of the image. At 17:49 UTC, sites recovered and by 17:50 error rates and latency returned to nominal levels.
Corrective Actions