[keycloak-user] Keycloak node cannot join cluster, initial state transfer timed out

Tue Sep 5 03:20:30 EDT 2017

Thanks Marek, I think I'm really working on two issues here, like you say the initial state transfer grows and, for example, if I was to add a new node to the cluster to scale for load, then it needs to be able to copy the state. I'm definitely looking at monitoring the caches, I found that I can use jboss-cli to read info about each cache (and turn statistics on and off, although the docs say that statistics adversely affect performance, so shouldn't be on all the time). One thing I did notice that since we had the problem and I found the jboss-cli command, each time I've called it to check it seems that the number of entries is 0 for the session cache. Admittedly it's only been a few times that I've checked it, but I would have thought there would be some entries. I'll look at jmx also, as the jboss cli output isn't a standard output which makes it less than straightforward to parse.

The second issue is that we have 3 nodes in the cluster and without setting the owners to at least 2 we were getting all sorts of strange things happen, the session expiry page kept appearing, as well as the expected session missing because the request comes in to the other server. Which was why I thinking that the replicated cache might be the way to go so that each node has all the session info etc.

What I wasn't sure of was if there was a reason it was specifically a distributed cache, and if I could/should change it. I'll have a go at changing it and see what happens! :) 

Matt

-----Original Message-----
From: Marek Posolda [mailto:mposolda at redhat.com] 
Sent: Tuesday, 5 September 2017 4:20 PM
To: Matt Evans <mevans at aconex.com>; Meissa M'baye Sakho <msakho at redhat.com>
Cc: keycloak-user at lists.jboss.org
Subject: Re: [keycloak-user] Keycloak node cannot join cluster, initial state transfer timed out

On 05/09/17 01:48, Matt Evans wrote:
> Yes I've been digging into the infinispan docs :) You're right, from what I gather, the default timeout for the initial state transfer is 4 minutes, I would have thought that would have to be a lot of sessions to transfer for it to take longer than 4 mins. Now looking at how to view statistics on the caches to monitor this stuff.
There is something available through JMX. You can connect with jconsole and see some statistics. Maybe statistics needs to be enabled for infinispan caches (again see docs for details). There may be other ways to monitor this, but this one is likely the easiest for the start.
>
> I was wondering why the standalone-ha caches are using distributed caches and are configured with 1 owner, is this because it assumes session affinity for connections from the load balancer? Does it make more sense if the load balancers are not using session affinity for the caches to be replicated caches rather than distributed caches?
distributed with 1 owner is here to save memory. And yes, there is some session affinity support in latest master. You can try to add 2 or more owners or use replicated cache if you need failover (eg. after some node is killed or restarted, it's user sessions are lost and users need to re-authenticate if you have just 1 owner). However state transfer will probably take even more time if you increase number of owners or re-configure cache to be replicated. You can try and see.

Marek
>
> Matt
>
>
> -----Original Message-----
> From: Marek Posolda [mailto:mposolda at redhat.com]
> Sent: Tuesday, 5 September 2017 1:44 AM
> To: Matt Evans <mevans at aconex.com>; Meissa M'baye Sakho <msakho at redhat.com>
> Cc: keycloak-user at lists.jboss.org
> Subject: Re: [keycloak-user] Keycloak node cannot join cluster, initial state transfer timed out
>
> I think that you were right. Your cache is too big, it likely contains many user sessions. So the initial state transfer took quite a long time. Maybe during weekend, most people were logged-out, hence the state transfer was able to finish in time...
>
> It's possible to increase the timeout for the state transfer (I think it's 240 seconds by default, but not 100% sure). It will be good to check infinispan documentation and documentation about wildfly infinispan subsystem, which should provide more details.
>
> Marek
>
> On 04/09/17 04:40, Matt Evans wrote:
>> Strangely, it seems to have fixed itself over the weekend. I came to look at it this morning and the new node successfully retrieved the initial state data. I've not made any changes to configuration etc.
>>
>> I'd still like to know why it was happening and how to prevent it though.
>>
>> Matt
>>
>>
>> -----Original Message-----
>> From: keycloak-user-bounces at lists.jboss.org
>> [mailto:keycloak-user-bounces at lists.jboss.org] On Behalf Of Matt Evans
>> Sent: Saturday, 2 September 2017 7:47 AM
>> To: Meissa M'baye Sakho <msakho at redhat.com>
>> Cc: keycloak-user at lists.jboss.org
>> Subject: Re: [keycloak-user] Keycloak node cannot join cluster,
>> initial state transfer timed out
>>
>> No, I just start up keycloak and run standalone ha. There's no mention
>> of that property in the keycloak docs about clustering
>>
>> Matt
>>
>> ________________________________
>> From: Meissa M'baye Sakho <msakho at redhat.com>
>> Sent: Saturday, September 2, 2017 12:53:35 AM
>> To: Matt Evans
>> Cc: keycloak-user at lists.jboss.org
>> Subject: Re: [keycloak-user] Keycloak node cannot join cluster,
>> initial state transfer timed out
>>
>> Matt,
>> How did you add your new node?
>> Have you defined the jboss.node.name<http://jboss.node.name> property in your new node?
>> Meissa
>>
>> On Fri, Sep 1, 2017 at 6:31 AM, Matt Evans <mevans at aconex.com<mailto:mevans at aconex.com>> wrote:
>> We're running keycloak clustered with standalone-ha.xml, and it's been working fine.
>>
>> We changed the 'owners' of the distributed caches for session, loginFailures etc to 2 so that it will distribute those caches across the 2 nodes in the cluster.
>>
>> Now, when I remove a node and add a new node, the new node fails to start some of the services, due to:
>>
>> org.infinispan.commons.CacheException: Initial state transfer timed
>> out for cache sessions on xxxx
>>
>> Is this because it's actually taking too long to fetch the initial cache data from the other node? Is it due to the size of the cache, or some other issue?
>>
>> What can I do to address this so that I can add the node back into the cluster?
>>
>> I'm not experienced at all in infinispan or jgroups, so any pointers on how to query the servers to see whats in the caches, and how to see what's actually happening will be appreciated!
>>
>> Thanks
>>
>> Matt
>> _______________________________________________
>> keycloak-user mailing list
>> keycloak-user at lists.jboss.org<mailto:keycloak-user at lists.jboss.org>
>> https://lists.jboss.org/mailman/listinfo/keycloak-user
>>
>> _______________________________________________
>> keycloak-user mailing list
>> keycloak-user at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/keycloak-user
>>
>> _______________________________________________
>> keycloak-user mailing list
>> keycloak-user at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/keycloak-user
>