Hello Jan,
Thank you for your answer. The JIRA you referred to me is very interesting, as are the
other possible routes of investigation you described. Unfortunately, at this moment I
don’t have the answer for your questions, as we hit this issue in a pre-prod environment
and there’s too many constraints to be able to run the necessary tests. For now, we’re
moving to a write-one/read-many setup as a work-around. However, in parallel, we’re
setting up a “lab” environment to check these issues, first to be able to reproduce the
problem consistently, then to see if moving to MariaDB 10.3.5+ solves the problem, and to
investigate if not. I’ll post here the results whether we have a success or a failure, but
it may take a few weeks.
Best regards,
Alistair Doswald
From: Jan Lieskovsky <jlieskov(a)redhat.com>
Sent: lundi, 7 octobre 2019 11:33
To: Doswald Alistair <alistair.doswald(a)elca.ch>
Cc: keycloak-user <keycloak-user(a)lists.jboss.org>; Poiffaut Romain
<romain.poiffaut(a)elca.ch>; Gutermann Bernard <bernard.gutermann(a)elca.ch>
Subject: Re: [keycloak-user] Database problems running a clustered multi-site keycloak on
MariaDB
Hello Alistair,
On Fri, Oct 4, 2019 at 4:20 PM Doswald Alistair
<alistair.doswald@elca.ch<mailto:alistair.doswald@elca.ch>> wrote:
Hello,
We're running into some important errors when running a keycloak on a multi-site
cluster with MariaDB as our multi-master database. We have a setup similar to
https://www.keycloak.org/docs/latest/server_installation/index.html#cross..., with
keycloak 7.0.0 and MariaDB 10.1.37. Each site will write to its own database cluster, and
we thought that MariaDB would handle the replication and transactions correctly.
It works well, until we get the following types of errors on the database, and then
everything crashes:
2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1
event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in
'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the
event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
See
MDEV-15405<https://jira.mariadb.org/browse/MDEV-15405> -- can you possibly retry
with MariaDB 10.3.5+ if the issue is still there?
If the MariaDB upgrade doesn't help, I would retry with "showSql" enabled
(start Keycloak with "-Dkeycloak.connectionsJpa.showSql=true"),
reproduce the issue again & try to isolate the SQL statement / set of SQL statements,
which is leading to this state. Maybe after
couple of times repeating the scenario / crash, such set can be identified.
Having that SQL statements set identified, the question is:
* If this is anoter MariaDB bug (hitting the same error msg & error code) via
those SQL statements (thus something to be fixed on MariaDB side), or
* If this is serialization issue of some kind (.. it happens sometimes because SQL
slave failed to ...) These circumstances would need to be identified.
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply
warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: Failed to apply app buffer: seqno:
591931, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1
event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in
'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the
event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply
warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx: source:
4f98589f-e5bd-11e9-9eb9-12b92fd5aeef version: 3 local: 0 state: APPLYING flags: 1 conn_id:
395 trx_id: 991166 seqnos (l: 18625, g: 591931, s: 591930, d: 584704, ts:
31567167461519)
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx 591931 4 times
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Node consistency compromized,
aborting...
.....................
From our analysis, it seems that a transaction was not able to be
replayed, which caused the database to shutdown to protect consistency.
Were you able to identify, at which code part this transaction deadlock happens? After
performing what action / steps? Or is it just
Keycloak is started with that setup & it happens after some time everytime? Did you
try different Keycloak / MariaDB versions?
This can seem to happen with race conditions from multiple writes. Looking into it we
found in the following document
https://galeracluster.com/library/kb/trouble/multi-master-conflicts.html this passage
"When two transactions are conflicting, the later of the two is rolled back by the
cluster. The client application registers this rollback as a deadlock error. Ideally, the
client application should retry the deadlocked transaction. However, not all client
applications have this logic built in."
Does anyone else have a similar setup? If yes, have you encountered this problem? Is there
a known resolution?
Best regards,
Alistair Doswald
Thank you && Regards, Jan
--
Jan iankko Lieskovsky / Keycloak / RH-SSO Team
_______________________________________________
keycloak-user mailing list
keycloak-user@lists.jboss.org<mailto:keycloak-user@lists.jboss.org>
https://lists.jboss.org/mailman/listinfo/keycloak-user