Dear Keycloak users (and devs),
This morning, we faced a production level issue on our Keycloak Cluster,
running in a 3-node formation on EC2. Symptoms were a high failure rate
of requests (> 20%) and high latency (> 10 seconds). We are currently
trying to figure out what went wrong. We would appreciate it if someone
with knowledge op JGroups / Inifinispan could chime in with a working
hypothesis.
About priority: we are currently running nominally. We have brought down
two of the instances and brought up two new instances. The cluster is
working again as expected.
Below I'll give information about our setup, the relevant log-messages
and links to some screenshots of our monitoring:
EC2 instances are C4.Large (3x)
Keycloak Version 3.1.0-FINAL
Normal CPU usage is around 1% or less. It peaked to 16% during the incident.
Memory usage is normal.
Screenshots:
- datadog statistics of our services calling keycloak:
https://ibb.co/dsDTKv
- AWS EC2 instance Cloudwatch statistics: network out rate (in bytes per
MINUTE):
https://ibb.co/j8jhCF
- AWS EC2 instance Cloudwatch statistics: network in rate (in bytes per
MINUTE):
https://ibb.co/ggLuRa
Log lines, just before failure (to help reduce clutter, I've removed the
date and replaced the IP addresses with "IP-A", "IP-B",
"IP-C"). The
last message is repeated around 500.000 times in the span of around 1
minute.
---------------------->%-----------------------
05:09:23,925 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-17,ee,ip-B) ISPN000094: Received new cluster view for channel
server: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,926 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-17,ee,ip-B) ISPN000094: Received new cluster view for channel
keycloak: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,926 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-17,ee,ip-B) ISPN000094: Received new cluster view for channel
web: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,926 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-17,ee,ip-B) ISPN000094: Received new cluster view for channel
ejb: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,928 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-17,ee,ip-B) ISPN000094: Received new cluster view for channel
hibernate: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,990 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-A) ISPN000094: Received new cluster view for channel
server: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,990 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-A) ISPN000094: Received new cluster view for channel
keycloak: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,991 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-A) ISPN000094: Received new cluster view for channel
web: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,992 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-A) ISPN000094: Received new cluster view for channel
hibernate: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,992 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-A) ISPN000094: Received new cluster view for channel
ejb: [ip-A|3] (2) [ip-A, ip-B]
05:09:23,996 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t18)
ISPN000310: Starting cluster-wide rebalance for cache authorization,
topology CacheTopology{id=6, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A: 54+26, ip-B:
26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A:
40+40, ip-B: 40+40]}, unionCH=null, actualMembers=[ip-A, ip-B]}
05:09:24,001 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t18)
ISPN000310: Starting cluster-wide rebalance for cache sessions, topology
CacheTopology{id=6, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A: 54+26, ip-B:
26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A:
40+40, ip-B: 40+40]}, unionCH=null, actualMembers=[ip-A, ip-B]}
05:09:24,004 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t18)
ISPN000310: Starting cluster-wide rebalance for cache offlineSessions,
topology CacheTopology{id=6, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A: 54+26, ip-B:
26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A:
40+40, ip-B: 40+40]}, unionCH=null, actualMembers=[ip-A, ip-B]}
05:09:24,014 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t18)
ISPN000310: Starting cluster-wide rebalance for cache loginFailures,
topology CacheTopology{id=6, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A: 54+26, ip-B:
26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners = (2)[ip-A:
40+40, ip-B: 40+40]}, unionCH=null, actualMembers=[ip-A, ip-B]}
05:09:24,027 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t130)
ISPN000336: Finished cluster-wide rebalance for cache sessions, topology
id = 6
05:09:24,028 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t130)
ISPN000336: Finished cluster-wide rebalance for cache offlineSessions,
topology id = 6
05:09:24,029 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t131)
ISPN000336: Finished cluster-wide rebalance for cache loginFailures,
topology id = 6
05:09:24,029 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t132)
ISPN000336: Finished cluster-wide rebalance for cache authorization,
topology id = 6
05:09:33,567 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-C) ISPN000093: Received new, MERGED cluster view for
channel server: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,569 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-C) ISPN000093: Received new, MERGED cluster view for
channel keycloak: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2
subgroups: [ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,569 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-C) ISPN000093: Received new, MERGED cluster view for
channel web: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,573 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-C) ISPN000093: Received new, MERGED cluster view for
channel ejb: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,575 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-1,ee,ip-C) ISPN000093: Received new, MERGED cluster view for
channel hibernate: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2
subgroups: [ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,521 WARN [org.jgroups.protocols.pbcast.NAKACK]
(Incoming-19,ee,ip-B) JGRP000011: ip-B: dropped message 54375 from
non-member ip-C (view=[ip-A|3] (2) [ip-A, ip-B]) Warning
05:09:33,527 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-20,ee,ip-B) ISPN000093: Received new, MERGED cluster view for
channel server: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,529 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-20,ee,ip-B) ISPN000093: Received new, MERGED cluster view for
channel keycloak: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2
subgroups: [ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,529 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-20,ee,ip-B) ISPN000093: Received new, MERGED cluster view for
channel web: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,530 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-20,ee,ip-B) ISPN000093: Received new, MERGED cluster view for
channel ejb: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,533 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-20,ee,ip-B) ISPN000093: Received new, MERGED cluster view for
channel hibernate: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2
subgroups: [ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,518 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-7,ee,ip-A) ISPN000093: Received new, MERGED cluster view for
channel server: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,525 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-7,ee,ip-A) ISPN000093: Received new, MERGED cluster view for
channel keycloak: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2
subgroups: [ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,525 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-7,ee,ip-A) ISPN000093: Received new, MERGED cluster view for
channel web: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,527 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-7,ee,ip-A) ISPN000093: Received new, MERGED cluster view for
channel hibernate: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2
subgroups: [ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,527 INFO
[org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-7,ee,ip-A) ISPN000093: Received new, MERGED cluster view for
channel ejb: MergeView::[ip-A|4] (3) [ip-A, ip-B, ip-C], 2 subgroups:
[ip-A|2] (3) [ip-A, ip-C, ip-B], [ip-A|3] (2) [ip-A, ip-B]
05:09:33,577 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t12)
ISPN000310: Starting cluster-wide rebalance for cache authorization,
topology CacheTopology{id=10, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (3)[ip-A: 27+53, ip-C:
27+53, ip-B: 26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners =
(3)[ip-A: 27+53, ip-B: 26+54, ip-C: 27+53]}, unionCH=null,
actualMembers=[ip-A, ip-B, ip-C]}
05:09:33,579 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t12)
ISPN000310: Starting cluster-wide rebalance for cache offlineSessions,
topology CacheTopology{id=10, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (3)[ip-A: 27+53, ip-C:
27+53, ip-B: 26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners =
(3)[ip-A: 27+53, ip-B: 26+54, ip-C: 27+53]}, unionCH=null,
actualMembers=[ip-A, ip-B, ip-C]}
05:09:33,580 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t12)
ISPN000310: Starting cluster-wide rebalance for cache loginFailures,
topology CacheTopology{id=10, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (3)[ip-A: 27+53, ip-C:
27+53, ip-B: 26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners =
(3)[ip-A: 27+53, ip-B: 26+54, ip-C: 27+53]}, unionCH=null,
actualMembers=[ip-A, ip-B, ip-C]}
05:09:33,582 INFO [org.infinispan.CLUSTER] (transport-thread--p14-t15)
ISPN000310: Starting cluster-wide rebalance for cache sessions, topology
CacheTopology{id=10, rebalanceId=3,
currentCH=DefaultConsistentHash{ns=80, owners = (3)[ip-A: 27+53, ip-C:
27+53, ip-B: 26+54]}, pendingCH=DefaultConsistentHash{ns=80, owners =
(3)[ip-A: 27+53, ip-B: 26+54, ip-C: 27+53]}, unionCH=null,
actualMembers=[ip-A, ip-B, ip-C]}
05:09:33,589 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t130)
ISPN000336: Finished cluster-wide rebalance for cache loginFailures,
topology id = 10
05:09:33,589 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t130)
ISPN000336: Finished cluster-wide rebalance for cache authorization,
topology id = 10
05:09:33,590 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t131)
ISPN000336: Finished cluster-wide rebalance for cache offlineSessions,
topology id = 10
05:09:33,601 INFO [org.infinispan.CLUSTER] (remote-thread--p7-t133)
ISPN000336: Finished cluster-wide rebalance for cache sessions, topology
id = 10
05:09:34,001 ERROR [org.jgroups.protocols.TCP] (Connection.Receiver
[IP-A:35361 - IP-C:7600],ee,ip-A) ip-A: failed submitting DONT_BUNDLE
message to thread pool: java.util.concurrent.RejectedExecutionException:
Task org.jgroups.protocols.TP$SingleMessageHandler@17c86a0c rejected
from java.util.concurrent.ThreadPoolExecutor@52cae691[Running, pool size
= 4, active threads = 4, queued tasks = 100, completed tasks = 1589693].
Msg: RequestCorrelator: id=200, type=REQ, id=607, rsp_expected=true,
FORK: ee:keycloak, NAKACK: [XMIT_RSP, seqno=206], TCP: [cluster_name=ee]
ERROR
05:09:34,010 ERROR [org.jgroups.protocols.TCP] (Connection.Receiver
[IP-A:35361 - IP-C:7600],ee,ip-A) ip-A: failed submitting DONT_BUNDLE
message to thread pool: java.util.concurrent.RejectedExecutionException:
Task org.jgroups.protocols.TP$SingleMessageHandler@4fec3655 rejected
from java.util.concurrent.ThreadPoolExecutor@52cae691[Running, pool size
= 4, active threads = 4, queued tasks = 100, completed tasks = 1589694].
Msg: RequestCorrelator: id=200, type=REQ, id=609, rsp_expected=true,
FORK: ee:keycloak, NAKACK: [XMIT_RSP, seqno=208], TCP: [cluster_name=ee]
ERROR
(last line repeated many, many times)
---------------------->%-----------------------
Infinispan subsystem configuration in standalone-ha.xml:
---------------------->%-----------------------
<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
<cache-container name="keycloak"
jndi-name="infinispan/Keycloak">
<transport lock-timeout="60000"/>
<local-cache name="realms">
<eviction max-entries="10000" strategy="LRU"/>
</local-cache>
<local-cache name="users">
<eviction max-entries="10000" strategy="LRU"/>
</local-cache>
<distributed-cache name="sessions" mode="SYNC"
owners="3"/>
<distributed-cache name="offlineSessions" mode="SYNC"
owners="3"/>
<distributed-cache name="loginFailures" mode="SYNC"
owners="3"/>
<distributed-cache name="authorization" mode="SYNC"
owners="3"/>
<replicated-cache name="work" mode="SYNC"/>
<local-cache name="keys">
<eviction max-entries="1000" strategy="LRU"/>
<expiration max-idle="3600000"/>
</local-cache>
</cache-container>
<cache-container name="server" aliases="singleton cluster"
default-cache="default" module="org.wildfly.clustering.server">
<transport lock-timeout="60000"/>
<replicated-cache name="default" mode="SYNC">
<transaction mode="BATCH"/>
</replicated-cache>
</cache-container>
<cache-container name="web" default-cache="dist"
module="org.wildfly.clustering.web.infinispan">
<transport lock-timeout="60000"/>
<distributed-cache name="dist" mode="ASYNC"
l1-lifespan="0"
owners="2">
<locking isolation="REPEATABLE_READ"/>
<transaction mode="BATCH"/>
<file-store/>
</distributed-cache>
</cache-container>
<cache-container name="ejb" aliases="sfsb"
default-cache="dist"
module="org.wildfly.clustering.ejb.infinispan">
<transport lock-timeout="60000"/>
<distributed-cache name="dist" mode="ASYNC"
l1-lifespan="0"
owners="2">
<locking isolation="REPEATABLE_READ"/>
<transaction mode="BATCH"/>
<file-store/>
</distributed-cache>
</cache-container>
<cache-container name="hibernate"
default-cache="local-query"
module="org.hibernate.infinispan">
<transport lock-timeout="60000"/>
<local-cache name="local-query">
<eviction strategy="LRU" max-entries="10000"/>
<expiration max-idle="100000"/>
</local-cache>
<invalidation-cache name="entity" mode="SYNC">
<transaction mode="NON_XA"/>
<eviction strategy="LRU" max-entries="10000"/>
<expiration max-idle="100000"/>
</invalidation-cache>
<replicated-cache name="timestamps" mode="ASYNC"/>
</cache-container>
</subsystem>
---------------------->%-----------------------
JGroups subsystem configuration:
---------------------->%-----------------------
<subsystem xmlns="urn:jboss:domain:jgroups:4.0"
default-stack="tcp">
<channels default="ee">
<channel name="ee" stack="tcp"/>
</channels>
<stacks>
<stack name="tcp">
<transport type="TCP" socket-binding="jgroups-tcp"/>
<protocol type="S3_PING">
<property name="access_key">S3AccessKey</property>
<property
name="secret_access_key">S3SecretAccessKey</property>
<property name="location">S3PingBucketName</property>
</protocol>
<protocol type="MERGE2"/>
<protocol type="FD_SOCK"
socket-binding="jgroups-tcp-fd"/>
<protocol type="FD"/>
<protocol type="VERIFY_SUSPECT"/>
<protocol type="pbcast.NAKACK"/>
<protocol type="UNICAST2"/>
<protocol type="pbcast.STABLE"/>
<protocol type="pbcast.GMS"/>
<protocol type="UFC"/>
<protocol type="MFC"/>
<protocol type="FRAG2"/>
<protocol type="RSVP"/>
</stack>
<stack name="udp">
<transport type="UDP" socket-binding="jgroups-udp"/>
<protocol type="PING"/>
<protocol type="MERGE3"/>
<protocol type="FD_SOCK"
socket-binding="jgroups-udp-fd"/>
<protocol type="FD_ALL"/>
<protocol type="VERIFY_SUSPECT"/>
<protocol type="pbcast.NAKACK2"/>
<protocol type="UNICAST3"/>
<protocol type="pbcast.STABLE"/>
<protocol type="pbcast.GMS"/>
<protocol type="UFC"/>
<protocol type="MFC"/>
<protocol type="FRAG2"/>
</stack>
</stacks>
</subsystem>
---------------------->%-----------------------
with kind regards,
Edwin de Jong
-- Simacan B.V. Data Engineer