Replication timeout and retransmission table issues when using Keycloak on 5 nodes

Wednesday, 22 August 2018

*SUMMARY*

I am currently trying to build an authentication app using Keycloak
deployed as a Docker service. My infrastructure is as follow :

   - Server : CentOS 7
   - Docker : 17.06.2-ce, with weaveworks net plugin
   - Keycloak : 3.3.0-Final
   - Postgre : 9.4
   - 5 Keycloak deployed as a cluster in a Docker swarm

I encounter an issue with the cache when building up the cluster. I do not
have any error while building a 2 nodes cluster, but when scaling to 5
node, many warning like this one appear :

WARN [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-3)
JGRP000041: bd3eeb23695b: message d8896fbba960::14 not found in
retransmission table

When these messages begin to appear, the containers stop responding
correctly and eventualy some of them stop their instance of Keycloak. This
kind of errors has occured on various occasions:

   - When starting the services, hence the app does not even success to
   start.
   - A few ours after a correct start of Keycloak, even with few activity
   on the nodes.

*SYMPTOMS*

When the app crashes I see :

1) Numerous logs based on the one shown above that seem to iterate (ie. the
same messages coming from a node that are not found "for ever") :

2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in
retransmission table
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in
retransmission table
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in
retransmission table
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in
retransmission table
...
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in
retransmission table
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in
retransmission table
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in
retransmission table
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in
retransmission table
...

2) The node from which the messaged should come that display various cache
errors :

2018-08-22 09:58:37,130 ERROR
[org.infinispan.interceptors.InvocationContextInterceptor]
(ServerService Thread Pool -- 61) ISPN000136: Error executing command
PutKeyValueCommand, writing keys [cluster-start-time]:
org.infinispan.util.concurrent.TimeoutException: Replication timeout

2018-08-22 09:58:37,149 ERROR [org.jboss.msc.service.fail]
(ServerService Thread Pool -- 61) MSC000001: Failed to start service
jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth:
org.jboss.msc.service.StartException in service
jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth:
java.lang.RuntimeException: RESTEASY003325: Failed to construct public
org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)

2018-08-22 09:58:37,178 ERROR
[org.jboss.as.controller.management-operation] (Controller Boot
Thread) WFLYCTL0013: Operation ("add") failed - address:
([("deployment" => "keycloak-server.war")]) - failure description:
{"WFLYCTL0080: Failed services" =>
{"jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth"
=> "java.lang.RuntimeException: RESTEASY003325: Failed to construct
public
org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
    Caused by: java.lang.RuntimeException: RESTEASY003325: Failed to
construct public
org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
    Caused by: org.infinispan.util.concurrent.TimeoutException:
Replication timeout"}}

2018-08-22 09:58:37,409 WARN
[org.infinispan.topology.CacheTopologyControlCommand] (ServerService
Thread Pool -- 60) ISPN000071: Caught exception when handling command
CacheTopologyControlCommand{cache=actionTokens, type=LEAVE,
sender=d8896fbba960, joinInfo=null, topologyId=0, rebalanceId=0,
currentCH=null, pendingCH=null, availabilityMode=null,
actualMembers=null, throwable=null, viewId=3}:
java.lang.IllegalArgumentException: A cache topology's pending
consistent hash must contain all the current consistent hash's members

Then, this node usually stops all caches and Keycloak.

*CONFIG AND SOLUTION ATTEMPTED*

I have unsuccessfully tried to :

   - Change timeout params on the various cache of Keycloak (in order to
   give more time to stabilize the cluster)
   - Change some default values for protocol NAKACK2 in Keycloak
   configuration file. The aim of this was to limit trafic between nodes and
   increase number of elements in retransmission table so that messages are
   not lost before all nodes received them. However, my issues are not lessen
   by those changes.

The configuration I am currently using is the following :

<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
    <cache-container name="keycloak"
jndi-name="infinispan/Keycloak">
        <transport lock-timeout="500000"/>
        <local-cache name="realms">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <local-cache name="users">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <distributed-cache name="sessions" mode="SYNC"
owners="3"/>
        <distributed-cache name="authenticationSessions"
mode="SYNC"
owners="3"/>
        <distributed-cache name="offlineSessions" mode="SYNC"
owners="1"/>
        <distributed-cache name="loginFailures" mode="SYNC"
owners="1"/>
        <local-cache name="authorization">
            <eviction max-entries="10000" strategy="LRU"/>
        </local-cache>
        <replicated-cache name="work" mode="SYNC"/>
        <local-cache name="keys">
            <eviction max-entries="1000" strategy="LRU"/>
            <expiration max-idle="3600000"/>
        </local-cache>
        <distributed-cache name="actionTokens" mode="SYNC"
owners="2">
            <eviction max-entries="-1" strategy="NONE"/>
            <expiration max-idle="-1" interval="300000"/>
        </distributed-cache>
    </cache-container>
...
    <cache-container name="ejb" aliases="sfsb"
default-cache="dist"
module="org.wildfly.clustering.ejb.infinispan">
        <transport lock-timeout="300000"/>
        <distributed-cache name="dist">
            <locking isolation="REPEATABLE_READ"/>
            <transaction mode="BATCH"/>
            <file-store/>
        </distributed-cache>
    </cache-container>
</subsystem>
...
<protocol type="pbcast.NAKACK2">
    <property name="use_mcast_xmit">false</property>
    <property name="xmit_table_num_rows">200</property>
</protocol>

Hence do you have any idea why this is happing and how to update my
configuration to solve this issue?

-- 
*Damien Douteaux*

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014