Incident Report
Incident Reference: 84100
Incident Overview
Outage of Kornit X Platform including Web Tier, Order Creation and Background Processing
Date/Time identified
09/02/2022 01:45 GMT
How was the issue identified?
- Various alerts were raised by the Kornit X system monitoring regarding failed connectivity with the Platform.
- In addition, several customers contacted our Out of Hours service to notify of issues accessing the web tier and connecting to the Platform via our APIs
What Date / Time was the service to customers first affected?
09/02/2022 01:37 GMT
What was the impact?
All Kornit X core order services were unavailable –
Including:
- Order creation and background processing of orders.
- Web tier meaning users were unable to view and manually process orders, download artwork, complete production tasks
Excluding:
- Front end Smartlinks that did not featuring pricing or stock data
- API services unrelated to order processing
What was done to restore service?
- Initially, the primary focus was around the Platform database due to experiencing failures with write operations and transactions being held open. Various actions were taken to troubleshoot this and identify the cause of these failures. This included a server reboot and a failover to the replica database which did not resolve the issue. Unfortunately, it took a significant amount of time to eliminate this as the cause of the issue.
- Subsequent analysis identified the issue to be related to blocked connections on the Rabbit MQ message queue service responsible for various backend processing tasks. These blocked connections caused the application to hang and not commit changes to the database.
- The monitoring services on the server responsible for the messaging did not flag the memory issue increasing the time it took to identify the problem.
- Steps were taken to unblock these connections which were unsuccessful so the services were restarted to force removing the blocked connections. Additional memory was also added to the infrastructure to provide additional capacity. After taking this action, traffic started to flow successfully. At this point the Web Tier was operational and order creation was available.
- The background processing application services were then restarted gracefully and various testing completed before confirming the incident as resolved and all services operational again.
- Some scheduled services were run manually to reduce further delays in executing background processing tasks and progressing orders to fulfilment.
- Our server monitoring team were able to restart the platform several times but had to get the software team involved when it became clear the platform kept locking. As a result, the service went up and down several times during the incident and orders were able to be processed for short intervals.
What time was service restored?
09/02/2022 10:50 GMT – Web Tier (platform.custom-gateway.net) and order creation
09/02/2022 11:30 GMT – Backend processing services
What caused the service to fail?
- A large volume of messages built up on the message queueing service resulting in memory usage increasing and connections being blocked. This in turn led to held application/database connections subsequently causing connectivity to the Platform to fail.
- The Platform did not handle the blocked connections to the message queue service gracefully resulting in a much greater impact than would be expected.
- What is being done to prevent recurrence?
- We are investigating why connections between the application layer and our messaging queue system did not simply timeout rather than being held open. A timeout would have been gracefully handled by the application and not resulted in any significant issues. A timeout also would have been picked up by our logging and alerts system, leading to a much quicker resolution.
- Once this is understood, changes will be made accordingly to ensure timeout errors do occur.
Are there any further actions?
- Introduction of additional monitoring of the message queue service to help identify this as the cause of any similar failures in future. This is now completed.