[keycloak-user] Database problems running a clustered multi-site keycloak on MariaDB

Tue Oct 8 04:41:02 EDT 2019

Hello Jan,

Thank you for your answer. The JIRA you referred to me is very interesting, as are the other possible routes of investigation you described. Unfortunately, at this moment I don’t have the answer for your questions, as we hit this issue in a pre-prod environment and there’s too many constraints to be able to run the necessary tests. For now, we’re moving to a write-one/read-many setup as a work-around. However, in parallel, we’re setting up a “lab” environment to check these issues, first to be able to reproduce the problem consistently, then to see if moving to MariaDB 10.3.5+ solves the problem, and to investigate if not. I’ll post here the results whether we have a success or a failure, but it may take a few weeks.

Best regards,

Alistair Doswald

From: Jan Lieskovsky <jlieskov at redhat.com>
Sent: lundi, 7 octobre 2019 11:33
To: Doswald Alistair <alistair.doswald at elca.ch>
Cc: keycloak-user <keycloak-user at lists.jboss.org>; Poiffaut Romain <romain.poiffaut at elca.ch>; Gutermann Bernard <bernard.gutermann at elca.ch>
Subject: Re: [keycloak-user] Database problems running a clustered multi-site keycloak on MariaDB

Hello Alistair,

On Fri, Oct 4, 2019 at 4:20 PM Doswald Alistair <alistair.doswald at elca.ch<mailto:alistair.doswald at elca.ch>> wrote:
Hello,

We're running into some important errors when running a keycloak on a multi-site cluster with MariaDB as our multi-master database. We have a setup similar to https://www.keycloak.org/docs/latest/server_installation/index.html#crossdc-mode, with keycloak 7.0.0 and MariaDB 10.1.37. Each site will write to its own database cluster, and we thought that MariaDB would handle the replication and transactions correctly.

It works well, until we get the following types of errors on the database, and then everything crashes:

2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in 'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032

See MDEV-15405<https://jira.mariadb.org/browse/MDEV-15405> -- can you possibly retry with MariaDB 10.3.5+ if the issue is still there?

If the MariaDB upgrade doesn't help, I would retry with "showSql" enabled (start Keycloak with "-Dkeycloak.connectionsJpa.showSql=true"),
reproduce the issue again & try to isolate the SQL statement / set of SQL statements, which is leading to this state. Maybe after
couple of times repeating the scenario / crash, such set can be identified.

Having that SQL statements set identified, the question is:

  *   If this is anoter MariaDB bug (hitting the same error msg & error code) via those SQL statements (thus something to be fixed on MariaDB side), or
  *   If this is serialization issue of some kind (.. it happens sometimes because SQL slave failed to ...) These circumstances would need to be identified.

2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: Failed to apply app buffer: seqno: 591931, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2019-10-03 14:09:46 140205469263616 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table cloudtrust-int-keycloak.EVENT_ENTITY; Can't find record in 'EVENT_ENTITY', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 883, Internal MariaDB error code: 1032
2019-10-03 14:09:46 140205469263616 [Warning] WSREP: RBR event 2 Delete_rows_v1 apply warning: 120, 591931
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx: source: 4f98589f-e5bd-11e9-9eb9-12b92fd5aeef version: 3 local: 0 state: APPLYING flags: 1 conn_id: 395 trx_id: 991166 seqnos (l: 18625, g: 591931, s: 591930, d: 584704, ts: 31567167461519)
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Failed to apply trx 591931 4 times
2019-10-03 14:09:46 140205469263616 [ERROR] WSREP: Node consistency compromized, aborting...
.....................

>From our analysis, it seems that a transaction was not able to be replayed, which caused the database to shutdown to protect consistency.

Were you able to identify, at which code part this transaction deadlock happens? After performing what action / steps? Or is it just
Keycloak is started with that setup & it happens after some time everytime? Did you try different Keycloak / MariaDB versions?

This can seem to happen with race conditions from multiple writes. Looking into it we found in the following document https://galeracluster.com/library/kb/trouble/multi-master-conflicts.html this passage "When two transactions are conflicting, the later of the two is rolled back by the cluster. The client application registers this rollback as a deadlock error. Ideally, the client application should retry the deadlocked transaction. However, not all client applications have this logic built in."

Does anyone else have a similar setup? If yes, have you encountered this problem? Is there a known resolution?

Best regards,

Alistair Doswald

Thank you && Regards, Jan
--
Jan iankko Lieskovsky / Keycloak / RH-SSO Team

_______________________________________________
keycloak-user mailing list
keycloak-user at lists.jboss.org<mailto:keycloak-user at lists.jboss.org>
https://lists.jboss.org/mailman/listinfo/keycloak-user