Incident Report
Incident Reference: 92861
Incident Overview
Incident affecting the replication between Master and Replica databases on the Kornit X Platform
Date/Time identified
08/11/2022 08:43 GMT
How was the issue identified?
- Alerts were raised by the Kornit X system monitoring
What Date / Time was the service to customers first affected?
08/11/2022 09:30 GMT
What was the impact?
The impact on customers was introduced by Kornit X as we had to disable our background processes and integrations to ensure data integrity across the Platform whilst the replication service was down.
This resulted in several functions and services being unavailable, including order processing, data import/export and production workflow for a number of customers.
Orders processed via API were not impacted and a number of functions within the Kornit X Web Portal were unaffected.
What was done to restore service?
- Initially, we restarted the Master database to get replication files outputting again
- After this was completed at 1 pm, we re-enabled several key services and integrations and continued doing this throughout the day to minimise customer impact and ensure ongoing operations across as many services as possible.
- Subsequently, we underwent a lengthy process of rebuilding our replica databases which involved taking a full backup of the Master Db, copying the backup files to the relevant server and restoring the replica databases.
- Post-rebuild of the replica databases, all remaining services were re-enabled.
What time was service restored?
- 08/11/2022 13:00 GMT – Primary services and integrations
- 08/11/2022 15:00 – 21:00 GMT – Secondary services and integrations
- 09/11/2022 10:00 GMT – All services and integrations operational
What caused the service to fail?
- V2 Order Manager search queries resulting in data-intense transactions being executed against the Master database.
- The creation of temporary tables to support these queries resulted in the temporary directory on the server running out of disk space. This invoked a default setting in MySQL which ultimately resulted in the Master database not writing to the replication logs and the replica databases not functioning.
What is being done to prevent recurrence?
- Platform V2 updated to use AWS Elastic Search service
- MySQL setting updated to avoid the scenario where the replication service fails and requires rebuilding.
Are there any further actions?
- Additional monitoring has been configured to help identify a similar occurrence earlier.