[keycloak-user] Replication timeout and retransmission table issues when using Keycloak on 5 nodes

Sebastian Laskawiec slaskawi at redhat.com
Mon Aug 27 04:06:03 EDT 2018


Let me add +Bela Ban <bban at redhat.com> to this thread. Maybe he has any
idea what happened.

>From an other email thread, I say the suggestion was to try increasing
FD_ALL timeout. Have you tried that?

On Wed, Aug 22, 2018 at 6:41 PM Damien Douteaux <damien.douteaux at gmail.com>
wrote:

> *SUMMARY*
>
> I am currently trying to build an authentication app using Keycloak
> deployed as a Docker service. My infrastructure is as follow :
>
>    - Server : CentOS 7
>    - Docker : 17.06.2-ce, with weaveworks net plugin
>    - Keycloak : 3.3.0-Final
>    - Postgre : 9.4
>    - 5 Keycloak deployed as a cluster in a Docker swarm
>
> I encounter an issue with the cache when building up the cluster. I do not
> have any error while building a 2 nodes cluster, but when scaling to 5
> node, many warning like this one appear :
>
> WARN [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-3)
> JGRP000041: bd3eeb23695b: message d8896fbba960::14 not found in
> retransmission table
>
> When these messages begin to appear, the containers stop responding
> correctly and eventualy some of them stop their instance of Keycloak. This
> kind of errors has occured on various occasions:
>
>    - When starting the services, hence the app does not even success to
>    start.
>    - A few ours after a correct start of Keycloak, even with few activity
>    on the nodes.
>
> *SYMPTOMS*
>
> When the app crashes I see :
>
> 1) Numerous logs based on the one shown above that seem to iterate (ie. the
> same messages coming from a node that are not found "for ever") :
>
> 2018-08-22 09:59:33,346 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in
> retransmission table
> 2018-08-22 09:59:33,346 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in
> retransmission table
> 2018-08-22 09:59:33,346 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in
> retransmission table
> 2018-08-22 09:59:33,346 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in
> retransmission table
> ...
> 2018-08-22 09:59:33,040 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in
> retransmission table
> 2018-08-22 09:59:33,040 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in
> retransmission table
> 2018-08-22 09:59:33,040 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in
> retransmission table
> 2018-08-22 09:59:33,040 WARN
> [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
> JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in
> retransmission table
> ...
>
> 2) The node from which the messaged should come that display various cache
> errors :
>
> 2018-08-22 09:58:37,130 ERROR
> [org.infinispan.interceptors.InvocationContextInterceptor]
> (ServerService Thread Pool -- 61) ISPN000136: Error executing command
> PutKeyValueCommand, writing keys [cluster-start-time]:
> org.infinispan.util.concurrent.TimeoutException: Replication timeout
>
> 2018-08-22 09:58:37,149 ERROR [org.jboss.msc.service.fail]
> (ServerService Thread Pool -- 61) MSC000001: Failed to start service
>
> jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth:
> org.jboss.msc.service.StartException in service
>
> jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth:
> java.lang.RuntimeException: RESTEASY003325: Failed to construct public
>
> org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
>
> 2018-08-22 09:58:37,178 ERROR
> [org.jboss.as.controller.management-operation] (Controller Boot
> Thread) WFLYCTL0013: Operation ("add") failed - address:
> ([("deployment" => "keycloak-server.war")]) - failure description:
> {"WFLYCTL0080: Failed services" =>
>
> {"jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth"
> => "java.lang.RuntimeException: RESTEASY003325: Failed to construct
> public
> org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
>     Caused by: java.lang.RuntimeException: RESTEASY003325: Failed to
> construct public
>
> org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
>     Caused by: org.infinispan.util.concurrent.TimeoutException:
> Replication timeout"}}
>
> 2018-08-22 09:58:37,409 WARN
> [org.infinispan.topology.CacheTopologyControlCommand] (ServerService
> Thread Pool -- 60) ISPN000071: Caught exception when handling command
> CacheTopologyControlCommand{cache=actionTokens, type=LEAVE,
> sender=d8896fbba960, joinInfo=null, topologyId=0, rebalanceId=0,
> currentCH=null, pendingCH=null, availabilityMode=null,
> actualMembers=null, throwable=null, viewId=3}:
> java.lang.IllegalArgumentException: A cache topology's pending
> consistent hash must contain all the current consistent hash's members
>
> Then, this node usually stops all caches and Keycloak.
>
> *CONFIG AND SOLUTION ATTEMPTED*
>
> I have unsuccessfully tried to :
>
>    - Change timeout params on the various cache of Keycloak (in order to
>    give more time to stabilize the cluster)
>    - Change some default values for protocol NAKACK2 in Keycloak
>    configuration file. The aim of this was to limit trafic between nodes
> and
>    increase number of elements in retransmission table so that messages are
>    not lost before all nodes received them. However, my issues are not
> lessen
>    by those changes.
>
> The configuration I am currently using is the following :
>
> <subsystem xmlns="urn:jboss:domain:infinispan:4.0">
>     <cache-container name="keycloak" jndi-name="infinispan/Keycloak">
>         <transport lock-timeout="500000"/>
>         <local-cache name="realms">
>             <eviction max-entries="10000" strategy="LRU"/>
>         </local-cache>
>         <local-cache name="users">
>             <eviction max-entries="10000" strategy="LRU"/>
>         </local-cache>
>         <distributed-cache name="sessions" mode="SYNC" owners="3"/>
>         <distributed-cache name="authenticationSessions" mode="SYNC"
> owners="3"/>
>         <distributed-cache name="offlineSessions" mode="SYNC" owners="1"/>
>         <distributed-cache name="loginFailures" mode="SYNC" owners="1"/>
>         <local-cache name="authorization">
>             <eviction max-entries="10000" strategy="LRU"/>
>         </local-cache>
>         <replicated-cache name="work" mode="SYNC"/>
>         <local-cache name="keys">
>             <eviction max-entries="1000" strategy="LRU"/>
>             <expiration max-idle="3600000"/>
>         </local-cache>
>         <distributed-cache name="actionTokens" mode="SYNC" owners="2">
>             <eviction max-entries="-1" strategy="NONE"/>
>             <expiration max-idle="-1" interval="300000"/>
>         </distributed-cache>
>     </cache-container>
> ...
>     <cache-container name="ejb" aliases="sfsb" default-cache="dist"
> module="org.wildfly.clustering.ejb.infinispan">
>         <transport lock-timeout="300000"/>
>         <distributed-cache name="dist">
>             <locking isolation="REPEATABLE_READ"/>
>             <transaction mode="BATCH"/>
>             <file-store/>
>         </distributed-cache>
>     </cache-container>
> </subsystem>
> ...
> <protocol type="pbcast.NAKACK2">
>     <property name="use_mcast_xmit">false</property>
>     <property name="xmit_table_num_rows">200</property>
> </protocol>
>
> Hence do you have any idea why this is happing and how to update my
> configuration to solve this issue?
>
>
> --
> *Damien Douteaux*
> _______________________________________________
> keycloak-user mailing list
> keycloak-user at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/keycloak-user
>


More information about the keycloak-user mailing list