System Outage

Incident Report for KornitX

Postmortem

Incident Report

Incident Reference: 84100

Incident Overview

Outage of Kornit X Platform including Web Tier, Order Creation and Background Processing

Date/Time identified

09/02/2022 01:45 GMT

How was the issue identified?

Various alerts were raised by the Kornit X system monitoring regarding failed connectivity with the Platform.
In addition, several customers contacted our Out of Hours service to notify of issues accessing the web tier and connecting to the Platform via our APIs

What Date / Time was the service to customers first affected?

09/02/2022 01:37 GMT

What was the impact?

All Kornit X core order services were unavailable –

Including:

Order creation and background processing of orders.
Web tier meaning users were unable to view and manually process orders, download artwork, complete production tasks

Excluding:

Front end Smartlinks that did not featuring pricing or stock data
API services unrelated to order processing

What was done to restore service?

Initially, the primary focus was around the Platform database due to experiencing failures with write operations and transactions being held open. Various actions were taken to troubleshoot this and identify the cause of these failures. This included a server reboot and a failover to the replica database which did not resolve the issue. Unfortunately, it took a significant amount of time to eliminate this as the cause of the issue.
Subsequent analysis identified the issue to be related to blocked connections on the Rabbit MQ message queue service responsible for various backend processing tasks. These blocked connections caused the application to hang and not commit changes to the database.
The monitoring services on the server responsible for the messaging did not flag the memory issue increasing the time it took to identify the problem.
Steps were taken to unblock these connections which were unsuccessful so the services were restarted to force removing the blocked connections. Additional memory was also added to the infrastructure to provide additional capacity. After taking this action, traffic started to flow successfully. At this point the Web Tier was operational and order creation was available.
The background processing application services were then restarted gracefully and various testing completed before confirming the incident as resolved and all services operational again.
Some scheduled services were run manually to reduce further delays in executing background processing tasks and progressing orders to fulfilment.
Our server monitoring team were able to restart the platform several times but had to get the software team involved when it became clear the platform kept locking. As a result, the service went up and down several times during the incident and orders were able to be processed for short intervals.

What time was service restored?

09/02/2022 10:50 GMT – Web Tier (platform.custom-gateway.net) and order creation

09/02/2022 11:30 GMT – Backend processing services

What caused the service to fail?

A large volume of messages built up on the message queueing service resulting in memory usage increasing and connections being blocked. This in turn led to held application/database connections subsequently causing connectivity to the Platform to fail.
The Platform did not handle the blocked connections to the message queue service gracefully resulting in a much greater impact than would be expected.
What is being done to prevent recurrence?
We are investigating why connections between the application layer and our messaging queue system did not simply timeout rather than being held open. A timeout would have been gracefully handled by the application and not resulted in any significant issues. A timeout also would have been picked up by our logging and alerts system, leading to a much quicker resolution.
Once this is understood, changes will be made accordingly to ensure timeout errors do occur.

Are there any further actions?

Introduction of additional monitoring of the message queue service to help identify this as the cause of any similar failures in future. This is now completed.

Posted Feb 09, 2022 - 21:36 UTC

Resolved

Full functionality has been restored and we will continue to monitor.

Bare in mind that while the system deals with the backlog there may be delays.

We will update shortly with details of our findings.

Posted Feb 09, 2022 - 12:01 UTC

Identified

The web tier at platform.custom-gateway.net is recovered and order creation is available again.

At this stage while we continue investigations we have not enabled background processing.

We will provide a further update at 11:30 UTC.

Posted Feb 09, 2022 - 11:00 UTC

Update

The service outage is being investigated by our Principal Technical team and the issue has been escalated to our senior management. All resources are deployed to resolve the problem as quickly as possible.

The incident is affecting our primary database but the exact cause is currently unknown we do not have ETA for resolution as yet. We are working to restore service as quickly as possible.

We will provide a further update at 11:00 UTC.

Posted Feb 09, 2022 - 10:26 UTC

Update

The investigation into the incident is ongoing.

We will provide a further update at 10.00 UTC

Posted Feb 09, 2022 - 09:06 UTC

Update

The investigation into the incident is ongoing.

We will provide a further update at 09.00 UTC

Posted Feb 09, 2022 - 08:37 UTC

Investigating

We are currently experiencing an outage of our platform.

An investigation with our hosting partners is underway and an update will be posted again at 08.30 UTC
Platform OMS / CPP / APIs /Smartlinks

Posted Feb 09, 2022 - 07:54 UTC

This incident affected: Platform Core (London - UK), Distributed Smartlinks (Global), and Artwork and Asset Generation (Global).