Database problems running a clustered multi-site keycloak on MariaDB

Friday, 4 October 2019

Hello,

We're running into some important errors when running a keycloak on a multi-site
cluster with MariaDB as our multi-master database. We have a setup similar to
https://www.keycloak.org/docs/latest/server_installation/index.html#cross..., with
keycloak 7.0.0 and MariaDB 10.1.37. Each site will write to its own database cluster, and
we thought that MariaDB would handle the replication and transactions correctly.

It works well, until we get the following types of errors on the database, and then
everything crashes:

2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1
event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in
'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the
event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply
warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: Failed to apply app buffer: seqno:
591931, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1
event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in
'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the
event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply
warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx: source:
4f98589f-e5bd-11e9-9eb9-12b92fd5aeef version: 3 local: 0 state: APPLYING flags: 1 conn_id:
395 trx_id: 991166 seqnos (l: 18625, g: 591931, s: 591930, d: 584704, ts: 31567167461519)
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx 591931 4 times
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Node consistency compromized,
aborting...
.....................

...
From our analysis, it seems that a transaction was not able to be
replayed, which caused the database to shutdown to protect consistency. This can seem to
happen with race conditions from multiple writes. Looking into it we found in the
following document
https://galeracluster.com/library/kb/trouble/multi-master-conflicts.html this passage
"When two transactions are conflicting, the later of the two is rolled back by the
cluster. The client application registers this rollback as a deadlock error. Ideally, the
client application should retry the deadlocked transaction. However, not all client
applications have this logic built in." 
Does anyone else have a similar setup? If yes, have you encountered this problem? Is there
a known resolution?

Best regards,

Alistair Doswald

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014