Thanks Marek, I think I'm really working on two issues here, like you say the initial
state transfer grows and, for example, if I was to add a new node to the cluster to scale
for load, then it needs to be able to copy the state. I'm definitely looking at
monitoring the caches, I found that I can use jboss-cli to read info about each cache (and
turn statistics on and off, although the docs say that statistics adversely affect
performance, so shouldn't be on all the time). One thing I did notice that since we
had the problem and I found the jboss-cli command, each time I've called it to check
it seems that the number of entries is 0 for the session cache. Admittedly it's only
been a few times that I've checked it, but I would have thought there would be some
entries. I'll look at jmx also, as the jboss cli output isn't a standard output
which makes it less than straightforward to parse.
The second issue is that we have 3 nodes in the cluster and without setting the owners to
at least 2 we were getting all sorts of strange things happen, the session expiry page
kept appearing, as well as the expected session missing because the request comes in to
the other server. Which was why I thinking that the replicated cache might be the way to
go so that each node has all the session info etc.
What I wasn't sure of was if there was a reason it was specifically a distributed
cache, and if I could/should change it. I'll have a go at changing it and see what
happens! :)
Matt
-----Original Message-----
From: Marek Posolda [mailto:mposolda@redhat.com]
Sent: Tuesday, 5 September 2017 4:20 PM
To: Matt Evans <mevans(a)aconex.com>; Meissa M'baye Sakho
<msakho(a)redhat.com>
Cc: keycloak-user(a)lists.jboss.org
Subject: Re: [keycloak-user] Keycloak node cannot join cluster, initial state transfer
timed out
On 05/09/17 01:48, Matt Evans wrote:
Yes I've been digging into the infinispan docs :) You're
right, from what I gather, the default timeout for the initial state transfer is 4
minutes, I would have thought that would have to be a lot of sessions to transfer for it
to take longer than 4 mins. Now looking at how to view statistics on the caches to monitor
this stuff.
There is something available through JMX. You can connect with jconsole
and see some statistics. Maybe statistics needs to be enabled for infinispan caches (again
see docs for details). There may be other ways to monitor this, but this one is likely the
easiest for the start.
I was wondering why the standalone-ha caches are using distributed caches and are
configured with 1 owner, is this because it assumes session affinity for connections from
the load balancer? Does it make more sense if the load balancers are not using session
affinity for the caches to be replicated caches rather than distributed caches?
distributed with 1 owner is here to save memory. And yes, there is some session
affinity support in latest master. You can try to add 2 or more owners or use replicated
cache if you need failover (eg. after some node is killed or restarted, it's user
sessions are lost and users need to re-authenticate if you have just 1 owner). However
state transfer will probably take even more time if you increase number of owners or
re-configure cache to be replicated. You can try and see.
Marek
Matt
-----Original Message-----
From: Marek Posolda [mailto:mposolda@redhat.com]
Sent: Tuesday, 5 September 2017 1:44 AM
To: Matt Evans <mevans(a)aconex.com>; Meissa M'baye Sakho
<msakho(a)redhat.com>
Cc: keycloak-user(a)lists.jboss.org
Subject: Re: [keycloak-user] Keycloak node cannot join cluster, initial state transfer
timed out
I think that you were right. Your cache is too big, it likely contains many user
sessions. So the initial state transfer took quite a long time. Maybe during weekend, most
people were logged-out, hence the state transfer was able to finish in time...
It's possible to increase the timeout for the state transfer (I think it's 240
seconds by default, but not 100% sure). It will be good to check infinispan documentation
and documentation about wildfly infinispan subsystem, which should provide more details.
Marek
On 04/09/17 04:40, Matt Evans wrote:
> Strangely, it seems to have fixed itself over the weekend. I came to look at it this
morning and the new node successfully retrieved the initial state data. I've not made
any changes to configuration etc.
>
> I'd still like to know why it was happening and how to prevent it though.
>
> Matt
>
>
> -----Original Message-----
> From: keycloak-user-bounces(a)lists.jboss.org
> [mailto:keycloak-user-bounces@lists.jboss.org] On Behalf Of Matt Evans
> Sent: Saturday, 2 September 2017 7:47 AM
> To: Meissa M'baye Sakho <msakho(a)redhat.com>
> Cc: keycloak-user(a)lists.jboss.org
> Subject: Re: [keycloak-user] Keycloak node cannot join cluster,
> initial state transfer timed out
>
> No, I just start up keycloak and run standalone ha. There's no mention
> of that property in the keycloak docs about clustering
>
> Matt
>
> ________________________________
> From: Meissa M'baye Sakho <msakho(a)redhat.com>
> Sent: Saturday, September 2, 2017 12:53:35 AM
> To: Matt Evans
> Cc: keycloak-user(a)lists.jboss.org
> Subject: Re: [keycloak-user] Keycloak node cannot join cluster,
> initial state transfer timed out
>
> Matt,
> How did you add your new node?
> Have you defined the jboss.node.name<http://jboss.node.name> property in your
new node?
> Meissa
>
> On Fri, Sep 1, 2017 at 6:31 AM, Matt Evans
<mevans@aconex.com<mailto:mevans@aconex.com>> wrote:
> We're running keycloak clustered with standalone-ha.xml, and it's been
working fine.
>
> We changed the 'owners' of the distributed caches for session, loginFailures
etc to 2 so that it will distribute those caches across the 2 nodes in the cluster.
>
> Now, when I remove a node and add a new node, the new node fails to start some of the
services, due to:
>
> org.infinispan.commons.CacheException: Initial state transfer timed
> out for cache sessions on xxxx
>
> Is this because it's actually taking too long to fetch the initial cache data
from the other node? Is it due to the size of the cache, or some other issue?
>
> What can I do to address this so that I can add the node back into the cluster?
>
> I'm not experienced at all in infinispan or jgroups, so any pointers on how to
query the servers to see whats in the caches, and how to see what's actually happening
will be appreciated!
>
> Thanks
>
> Matt
> _______________________________________________
> keycloak-user mailing list
> keycloak-user@lists.jboss.org<mailto:keycloak-user@lists.jboss.org>
>
https://lists.jboss.org/mailman/listinfo/keycloak-user
>
> _______________________________________________
> keycloak-user mailing list
> keycloak-user(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/keycloak-user
>
> _______________________________________________
> keycloak-user mailing list
> keycloak-user(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/keycloak-user