[keycloak-user] Database problems running a clustered multi-site keycloak on MariaDB

Fri Oct 4 09:54:37 EDT 2019

Hello,

We're running into some important errors when running a keycloak on a multi-site cluster with MariaDB as our multi-master database. We have a setup similar to https://www.keycloak.org/docs/latest/server_installation/index.html#crossdc-mode, with keycloak 7.0.0 and MariaDB 10.1.37. Each site will write to its own database cluster, and we thought that MariaDB would handle the replication and transactions correctly.

It works well, until we get the following types of errors on the database, and then everything crashes:

2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in 'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: Failed to apply app buffer: seqno: 591931, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in 'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx: source: 4f98589f-e5bd-11e9-9eb9-12b92fd5aeef version: 3 local: 0 state: APPLYING flags: 1 conn_id: 395 trx_id: 991166 seqnos (l: 18625, g: 591931, s: 591930, d: 584704, ts: 31567167461519)
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx 591931 4 times
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Node consistency compromized, aborting...
.....................

>From our analysis, it seems that a transaction was not able to be replayed, which caused the database to shutdown to protect consistency. This can seem to happen with race conditions from multiple writes. Looking into it we found in the following document https://galeracluster.com/library/kb/trouble/multi-master-conflicts.html this passage "When two transactions are conflicting, the later of the two is rolled back by the cluster. The client application registers this rollback as a deadlock error. Ideally, the client application should retry the deadlocked transaction. However, not all client applications have this logic built in."

Does anyone else have a similar setup? If yes, have you encountered this problem? Is there a known resolution?

Best regards,

Alistair Doswald