Database Replication
Incident Report for KornitX
Postmortem

Incident Report

Incident Reference: 92861

Incident Overview

Incident affecting the replication between Master and Replica databases on the Kornit X Platform

Date/Time identified

08/11/2022 08:43 GMT

How was the issue identified?

  • Alerts were raised by the Kornit X system monitoring

What Date / Time was the service to customers first affected?

08/11/2022 09:30 GMT

What was the impact?

The impact on customers was introduced by Kornit X as we had to disable our background processes and integrations to ensure data integrity across the Platform whilst the replication service was down.

This resulted in several functions and services being unavailable, including order processing, data import/export and production workflow for a number of customers.

Orders processed via API were not impacted and a number of functions within the Kornit X Web Portal were unaffected.

What was done to restore service?

  • Initially, we restarted the Master database to get replication files outputting again
  • After this was completed at 1 pm, we re-enabled several key services and integrations and continued doing this throughout the day to minimise customer impact and ensure ongoing operations across as many services as possible.
  • Subsequently, we underwent a lengthy process of rebuilding our replica databases which involved taking a full backup of the Master Db, copying the backup files to the relevant server and restoring the replica databases.
  • Post-rebuild of the replica databases, all remaining services were re-enabled.

What time was service restored?

  • 08/11/2022 13:00 GMT – Primary services and integrations
  • 08/11/2022 15:00 – 21:00 GMT – Secondary services and integrations
  • 09/11/2022 10:00 GMT – All services and integrations operational

What caused the service to fail?

  • V2 Order Manager search queries resulting in data-intense transactions being executed against the Master database.
  • The creation of temporary tables to support these queries resulted in the temporary directory on the server running out of disk space. This invoked a default setting in MySQL which ultimately resulted in the Master database not writing to the replication logs and the replica databases not functioning.

What is being done to prevent recurrence?

  • Platform V2 updated to use AWS Elastic Search service
  • MySQL setting updated to avoid the scenario where the replication service fails and requires rebuilding.

Are there any further actions?

  • Additional monitoring has been configured to help identify a similar occurrence earlier.
Posted Nov 09, 2022 - 18:37 UTC

Resolved
This incident is now resolved.

Note, there is a backlog in processing data exports/reports following this issue so please expect some delays in exporting data from the Platform. Capacity on these services has been increased to help clear this backlog as quickly as possible.

If you experience any outstanding issues please contact our support team via any tickets you have open or log a new ticket via our Online Portal (https://support.kornitx.net/portal/en/home) and we will investigate accordingly.
Posted Nov 09, 2022 - 12:19 UTC
Monitoring
All remaining Kornit X services have now been re-enabled and are operational. We are continuing to monitor for a short period before updating this incident as resolved.

If you experience any outstanding issues please contact our support team via any tickets you have open or log a new ticket via our Online Portal (https://support.kornitx.net/portal/en/home) and we will investigate accordingly.
Posted Nov 09, 2022 - 10:57 UTC
Update
We have now completed the process of recreating the replica database. We are currently running some checks before enabling the remaining background processes on this database. This will be done in a phased manner whilst monitoring all services.

We will provide a further update soon.
Posted Nov 09, 2022 - 09:04 UTC
Update
We are continuing to work on this issue. Unfortunately, the process of backing up the master database, copying the backup files to the relevant server and restoring the replica databases is going to take longer than initially estimated.

As a result, this issue will be ongoing until tomorrow. We are continuing to enable functions/integrations as required to minimise any impact and ensure ongoing operations across as many services as possible.

If you are still experiencing a significant impact on any services, especially those relating to the processing of orders, please contact our support team via our Online Portal (https://support.kornitx.net/portal/en/home) and we will investigate accordingly.
Posted Nov 08, 2022 - 22:33 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 08, 2022 - 17:48 UTC
Update
We are continuing to work on the process to back up our Master database and then restore the data to the replica databases so that all remaining background processing can be re-enabled. We do not have an ETA on this yet.

A lot of services have been re-enabled including the CSV Order Import function and a number of Supplier Integrations.
Posted Nov 08, 2022 - 15:52 UTC
Update
The Master Db has been restarted successfully and the replication files are being written again. Some primary services/processes have been re-enabled. We are now starting the process of rebuilding our replica databases after which point we can re-enable all remaining background processing tasks. However, this will take a number of hours.

We will try to get as many services/background processing tasks operational as possible during this time to minimise the ongoing impact of this incident.
Posted Nov 08, 2022 - 13:13 UTC
Update
We are continuing to investigate this issue.
Posted Nov 08, 2022 - 13:04 UTC
Update
Please note, we are restarting the Master database for our Platform in 5 to 10 minutes. This will result in downtime to all services for a short period. Please expect that no Kornit X services will be operational during this time.
Posted Nov 08, 2022 - 12:43 UTC
Update
We are continuing to investigate this issue.
Posted Nov 08, 2022 - 12:38 UTC
Update
We are continuing to work on this incident.

Due to the issues affecting the database replication on our Platform, we need to go through a process of rebuilding our replica databases in order to get all services operational again.

As this is a lengthy process that could take a number of hours, we will be re-enabling some of the primary processing services and pointing these to our Master database in the short term in order to get these key services operational.

We will update once these primary services are operational and provide further details regarding what is operational and what is still impacted by this issue.
Posted Nov 08, 2022 - 12:08 UTC
Investigating
We have an issue affecting the replication between our databases. This is being investigated currently and updates will be provided as soon as possible.

To minimise the impact of this issue, we have disabled our background processes which will impact a number of functions on our Kornit X Platform Portal and the integrations running on the Platform.
Posted Nov 08, 2022 - 11:06 UTC
This incident affected: Platform Core (London - UK) and Artwork and Asset Generation (Global).