Let me add +Bela Ban <bban(a)redhat.com> to this thread. Maybe he has any
idea what happened.
FD_ALL timeout. Have you tried that?
On Wed, Aug 22, 2018 at 6:41 PM Damien Douteaux <damien.douteaux(a)gmail.com>
wrote:
*SUMMARY*
I am currently trying to build an authentication app using Keycloak
deployed as a Docker service. My infrastructure is as follow :
- Server : CentOS 7
- Docker : 17.06.2-ce, with weaveworks net plugin
- Keycloak : 3.3.0-Final
- Postgre : 9.4
- 5 Keycloak deployed as a cluster in a Docker swarm
I encounter an issue with the cache when building up the cluster. I do not
have any error while building a 2 nodes cluster, but when scaling to 5
node, many warning like this one appear :
WARN [org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-3)
JGRP000041: bd3eeb23695b: message d8896fbba960::14 not found in
retransmission table
When these messages begin to appear, the containers stop responding
correctly and eventualy some of them stop their instance of Keycloak. This
kind of errors has occured on various occasions:
- When starting the services, hence the app does not even success to
start.
- A few ours after a correct start of Keycloak, even with few activity
on the nodes.
*SYMPTOMS*
When the app crashes I see :
1) Numerous logs based on the one shown above that seem to iterate (ie. the
same messages coming from a node that are not found "for ever") :
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in
retransmission table
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in
retransmission table
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in
retransmission table
2018-08-22 09:59:33,346 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in
retransmission table
...
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::15 not found in
retransmission table
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::16 not found in
retransmission table
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::17 not found in
retransmission table
2018-08-22 09:59:33,040 WARN
[org.jboss.as.clustering.jgroups.protocol.NAKACK2] (thread-2)
JGRP000041: bd3eeb23695b: message d8896fbba960::18 not found in
retransmission table
...
2) The node from which the messaged should come that display various cache
errors :
2018-08-22 09:58:37,130 ERROR
[org.infinispan.interceptors.InvocationContextInterceptor]
(ServerService Thread Pool -- 61) ISPN000136: Error executing command
PutKeyValueCommand, writing keys [cluster-start-time]:
org.infinispan.util.concurrent.TimeoutException: Replication timeout
2018-08-22 09:58:37,149 ERROR [org.jboss.msc.service.fail]
(ServerService Thread Pool -- 61) MSC000001: Failed to start service
jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth:
org.jboss.msc.service.StartException in service
jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth:
java.lang.RuntimeException: RESTEASY003325: Failed to construct public
org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
2018-08-22 09:58:37,178 ERROR
[org.jboss.as.controller.management-operation] (Controller Boot
Thread) WFLYCTL0013: Operation ("add") failed - address:
([("deployment" => "keycloak-server.war")]) - failure
description:
{"WFLYCTL0080: Failed services" =>
{"jboss.undertow.deployment.default-server.default-host./odino-stif-keycloak-int/auth"
=> "java.lang.RuntimeException: RESTEASY003325: Failed to construct
public
org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
Caused by: java.lang.RuntimeException: RESTEASY003325: Failed to
construct public
org.keycloak.services.resources.KeycloakApplication(javax.servlet.ServletContext,org.jboss.resteasy.core.Dispatcher)
Caused by: org.infinispan.util.concurrent.TimeoutException:
Replication timeout"}}
2018-08-22 09:58:37,409 WARN
[org.infinispan.topology.CacheTopologyControlCommand] (ServerService
Thread Pool -- 60) ISPN000071: Caught exception when handling command
CacheTopologyControlCommand{cache=actionTokens, type=LEAVE,
sender=d8896fbba960, joinInfo=null, topologyId=0, rebalanceId=0,
currentCH=null, pendingCH=null, availabilityMode=null,
actualMembers=null, throwable=null, viewId=3}:
java.lang.IllegalArgumentException: A cache topology's pending
consistent hash must contain all the current consistent hash's members
Then, this node usually stops all caches and Keycloak.
*CONFIG AND SOLUTION ATTEMPTED*
I have unsuccessfully tried to :
- Change timeout params on the various cache of Keycloak (in order to
give more time to stabilize the cluster)
- Change some default values for protocol NAKACK2 in Keycloak
configuration file. The aim of this was to limit trafic between nodes
and
increase number of elements in retransmission table so that messages are
not lost before all nodes received them. However, my issues are not
lessen
by those changes.
The configuration I am currently using is the following :
<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
<cache-container name="keycloak"
jndi-name="infinispan/Keycloak">
<transport lock-timeout="500000"/>
<local-cache name="realms">
<eviction max-entries="10000" strategy="LRU"/>
</local-cache>
<local-cache name="users">
<eviction max-entries="10000" strategy="LRU"/>
</local-cache>
<distributed-cache name="sessions" mode="SYNC"
owners="3"/>
<distributed-cache name="authenticationSessions"
mode="SYNC"
owners="3"/>
<distributed-cache name="offlineSessions" mode="SYNC"
owners="1"/>
<distributed-cache name="loginFailures" mode="SYNC"
owners="1"/>
<local-cache name="authorization">
<eviction max-entries="10000" strategy="LRU"/>
</local-cache>
<replicated-cache name="work" mode="SYNC"/>
<local-cache name="keys">
<eviction max-entries="1000" strategy="LRU"/>
<expiration max-idle="3600000"/>
</local-cache>
<distributed-cache name="actionTokens" mode="SYNC"
owners="2">
<eviction max-entries="-1" strategy="NONE"/>
<expiration max-idle="-1" interval="300000"/>
</distributed-cache>
</cache-container>
...
<cache-container name="ejb" aliases="sfsb"
default-cache="dist"
module="org.wildfly.clustering.ejb.infinispan">
<transport lock-timeout="300000"/>
<distributed-cache name="dist">
<locking isolation="REPEATABLE_READ"/>
<transaction mode="BATCH"/>
<file-store/>
</distributed-cache>
</cache-container>
</subsystem>
...
<protocol type="pbcast.NAKACK2">
<property name="use_mcast_xmit">false</property>
<property name="xmit_table_num_rows">200</property>
</protocol>
Hence do you have any idea why this is happing and how to update my
configuration to solve this issue?
--
*Damien Douteaux*
_______________________________________________
keycloak-user mailing list
keycloak-user(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/keycloak-user