[keycloak-user] Standalone HA tokens not immediately shared among nodes

Wed Sep 26 05:11:11 EDT 2018

Yes, this is by design.

The UserLookupProvider.getUserById method always need to be properly 
implemented as shown in our quickstarts. For example: 
https://github.com/keycloak/keycloak-quickstarts/tree/latest/user-storage-simple 
.

I agree that a need to lookup user on more cluster nodes isn't always 
the great. One possibility to avoid this is to use sticky sessions, 
which we support. However if you use other adapter than keycloak.js, the 
sticky sessions can't be fully guaranteed as some requests like the 
"refresh-token" request are backchannel requests, which are triggered 
outside of the browser and can't participate in the original browser 
sticky session. So it can happen that this happens on the other cluster 
node.

Support for all the caches like "users" to be replicated will have some 
other consequences and probably in most cases, it won't help with 
performance anyway but rather the opposite (EG. you don't need to lookup 
the user on more nodes, but always when user is cached, the cache 
entries need to be replicated to all the nodes, then invalidated on all 
the nodes in case of updates etc etc). ATM I don't think that we will 
support it (but nothing is set in stone...).

Marek

On 25/09/18 20:55, D V wrote:
> Thanks for the responses, folks. The issue now isn't the inability to 
> set all caches to replicated. It's that a get-user-by-id is called 
> whenever a node has to process a an authentication via a refresh token 
> that wasn't also issued by that same node. See the last paragraph of 
> http://lists.jboss.org/pipermail/keycloak-user/2018-September/015549.html 
> . The results are cached, but only on the original issuing node. I was 
> expecting the user-by-id information to be shared between keycloak 
> nodes to avoid external service calls, but perhaps this is by design? 
> If so, could you explain why?
>
> On Tue, Sep 25, 2018 at 9:14 AM Marek Posolda <mposolda at redhat.com 
> <mailto:mposolda at redhat.com>> wrote:
>
>     Some more info about our caches:
>     https://www.keycloak.org/docs/latest/server_installation/index.html#cache-configuration
>
>     Not sure if this info should be updated and some more things to be
>     clearified?
>
>     Marek
>
>     On 25/09/18 15:12, Marek Posolda wrote:
>>     Sorry, I did not read whole thread.
>>
>>     Just a quick note, that caches "realms", "users", "keys" and
>>     "authorization" are supposed to be local caches. The pattern,
>>     we're using ATM is, that every cluster node caches it's data
>>     (realms, users etc) locally. In case that some objects are
>>     updated (EG. realm or users), there is separate cache "work",
>>     which make sure to notify other cluster nodes (or even nodes on
>>     all the other DCs), so all the nodes can invalidate particular
>>     cached object from their caches.
>>
>>     Caches "realms", "users", "keys" and "authorization" are not
>>     meant to be replicated/distributed, but local. So this
>>     NotSerializableException doesn't look like a bug to me.
>>
>>     Marek
>>
>>     On 25/09/18 12:06, Sebastian Laskawiec wrote:
>>>     Thanks a lot for checking this.
>>>
>>>     This seems like a bug to me, so I filled
>>>     https://issues.jboss.org/browse/KEYCLOAK-8415. Unfortunately, we
>>>     are preparing for some urgent work on the product side and I
>>>     can't promise you when we will be able to look into this. I
>>>     highly encourage you to contribute a fix if you are in hurry or
>>>     just subscribe to the ticket and wait till we find a free slot
>>>     to get it fixed.
>>>
>>>     Thanks,
>>>     Sebastian
>>>
>>>     On Thu, Sep 20, 2018 at 4:27 PM D V <dv at glyphy.com
>>>     <mailto:dv at glyphy.com>> wrote:
>>>
>>>         OK. So, with all caches being replicated, there's an error
>>>         on startup:
>>>
>>>         2018-09-20 14:03:38,307 ERROR
>>>         [org.infinispan.remoting.rpc.RpcManagerImpl] (ServerService
>>>         Thread Pool -- 62) ISPN000073: Unexpected error while
>>>         replicating:
>>>         org.infinispan.commons.marshall.NotSerializableException:
>>>         org.keycloak.models.PasswordPolicy$Builder
>>>         Caused by: an exception which occurred:
>>>         in field org.keycloak.models.PasswordPolicy.builder
>>>         in object org.keycloak.models.PasswordPolicy at 6ab5350d
>>>         in field
>>>         org.keycloak.models.cache.infinispan.entities.CachedRealm.passwordPolicy
>>>         in object
>>>         org.keycloak.models.cache.infinispan.entities.CachedRealm at 7864be21
>>>         in object
>>>         org.keycloak.models.cache.infinispan.entities.CachedRealm at 7864be21
>>>         in object
>>>         org.infinispan.commands.write.PutKeyValueCommand at fec4dc5e
>>>         in object
>>>         org.infinispan.commands.remote.SingleRpcCommand at 3f2e5d1a
>>>
>>>         If I make the realms cache local but leave the rest
>>>         replicated, I observe the same behaviour: the node that
>>>         didn't issue the original set of refresh/access tokens does
>>>         a getUserById lookup, which in my case results in a network
>>>         call against a remote service.
>>>
>>>         I noticed there are caches running that aren't mentioned in
>>>         the config, like userRevisions. These are local and adding
>>>         them to the config as replicated doesn't actually make them
>>>         as such.
>>>
>>>         On Thu, Sep 20, 2018 at 7:36 AM Sebastian Laskawiec
>>>         <slaskawi at redhat.com <mailto:slaskawi at redhat.com>> wrote:
>>>
>>>             Could you please try to unify the caches? Please replace
>>>             all local-cache and distributed-cache with replicated-cache.
>>>
>>>             Even though using distributed caches over replicated
>>>             ones should be the cause, I think those local caches
>>>             might cause the behavior you're describing.
>>>
>>>             On Wed, Sep 19, 2018 at 3:21 PM D V <dv at glyphy.com
>>>             <mailto:dv at glyphy.com>> wrote:
>>>
>>>                 Makes sense re: replicated caches. Here's my
>>>                 infinispan subsystem config right now:
>>>
>>>                         <subsystem
>>>                 xmlns="urn:jboss:domain:infinispan:4.0">
>>>                 <cache-container name="keycloak"
>>>                 jndi-name="infinispan/Keycloak"
>>>                 statistics-enabled="true">
>>>                                 <transport lock-timeout="60000"/>
>>>                 <local-cache name="realms" statistics-enabled="true">
>>>                 <eviction max-entries="10000" strategy="LRU"/>
>>>                 </local-cache>
>>>                 <local-cache name="users" statistics-enabled="true">
>>>                 <eviction max-entries="10000" strategy="LRU"/>
>>>                 </local-cache>
>>>
>>>                                 <!--
>>>                                 These two need to be replicated or
>>>                 the node that didn't issue the initial refresh token
>>>                                 will return "invalid_grant" errors
>>>                 when attempting to auth with that refresh token.
>>>                                 -->
>>>                 <replicated-cache name="sessions"
>>>                 statistics-enabled="true"/>
>>>                 <replicated-cache name="clientSessions"
>>>                 statistics-enabled="true"/>
>>>
>>>                 <distributed-cache name="authenticationSessions"
>>>                 mode="SYNC" owners="1" statistics-enabled="true"/>
>>>                 <distributed-cache name="offlineSessions"
>>>                 mode="SYNC" owners="1" statistics-enabled="true"/>
>>>                 <distributed-cache name="offlineClientSessions"
>>>                 mode="SYNC" owners="1" statistics-enabled="true"/>
>>>                 <distributed-cache name="loginFailures" mode="SYNC"
>>>                 owners="1" statistics-enabled="true"/>
>>>                 <local-cache name="authorization"
>>>                 statistics-enabled="true">
>>>                 <eviction max-entries="10000" strategy="LRU"/>
>>>                 </local-cache>
>>>                 <replicated-cache name="work" mode="SYNC"
>>>                 statistics-enabled="true"/>
>>>                 <local-cache name="keys" statistics-enabled="true">
>>>                 <eviction max-entries="1000" strategy="LRU"/>
>>>                 <expiration max-idle="3600000"/>
>>>                 </local-cache>
>>>                 <distributed-cache name="actionTokens" mode="SYNC"
>>>                 owners="2" statistics-enabled="true">
>>>                 <eviction max-entries="-1" strategy="NONE"/>
>>>                 <expiration max-idle="-1" interval="300000"/>
>>>                 </distributed-cache>
>>>                 </cache-container>
>>>                 <cache-container name="server" aliases="singleton
>>>                 cluster" default-cache="default"
>>>                 module="org.wildfly.clustering.server">
>>>                                 <transport lock-timeout="60000"/>
>>>                 <replicated-cache name="default">
>>>                 <transaction mode="BATCH"/>
>>>                 </replicated-cache>
>>>                 </cache-container>
>>>                 <cache-container name="web" default-cache="dist"
>>>                 module="org.wildfly.clustering.web.infinispan">
>>>                                 <transport lock-timeout="60000"/>
>>>                 <distributed-cache name="dist">
>>>                 <locking isolation="REPEATABLE_READ"/>
>>>                 <transaction mode="BATCH"/>
>>>                 <file-store/>
>>>                 </distributed-cache>
>>>                 </cache-container>
>>>                 <cache-container name="ejb" aliases="sfsb"
>>>                 default-cache="dist"
>>>                 module="org.wildfly.clustering.ejb.infinispan">
>>>                                 <transport lock-timeout="60000"/>
>>>                 <distributed-cache name="dist">
>>>                 <locking isolation="REPEATABLE_READ"/>
>>>                 <transaction mode="BATCH"/>
>>>                 <file-store/>
>>>                 </distributed-cache>
>>>                 </cache-container>
>>>                 <cache-container name="hibernate"
>>>                 default-cache="local-query"
>>>                 module="org.hibernate.infinispan">
>>>                                 <transport lock-timeout="60000"/>
>>>                 <local-cache name="local-query">
>>>                 <eviction strategy="LRU" max-entries="10000"/>
>>>                 <expiration max-idle="100000"/>
>>>                 </local-cache>
>>>                 <invalidation-cache name="entity">
>>>                 <transaction mode="NON_XA"/>
>>>                 <eviction strategy="LRU" max-entries="10000"/>
>>>                 <expiration max-idle="100000"/>
>>>                 </invalidation-cache>
>>>                 <replicated-cache name="timestamps" mode="ASYNC"/>
>>>                 </cache-container>
>>>                         </subsystem>
>>>
>>>                 The scenario I'm testing:
>>>                 1. Auth with grant_type=password on node1.
>>>                 2. Shut down node1.
>>>                 3. Auth with grant_type=refresh_token on node2.
>>>
>>>                 When client_sessions is not replicated (distributed,
>>>                 with owners=1, as in the distribution's
>>>                 standalone-ha.xml), I get this on node2:
>>>                 {
>>>                     "error": "invalid_grant",
>>>                     "error_description": "Session doesn't have
>>>                 required client"
>>>                 }
>>>
>>>                 When sessions is not replicated:
>>>                 {
>>>                     "error": "invalid_grant",
>>>                     "error_description": "Session not active"
>>>                 }
>>>
>>>                 On Wed, Sep 19, 2018 at 6:56 AM Sebastian Laskawiec
>>>                 <slaskawi at redhat.com <mailto:slaskawi at redhat.com>>
>>>                 wrote:
>>>
>>>                     Thanks for letting us know DV!
>>>
>>>                     Setting the number of owners equal to the
>>>                     cluster size doesn't make any sense. You might
>>>                     use a replicated cache in that scenarios (which
>>>                     works the same way apart from some Infinispan
>>>                     internal behavior, which can be omitted in your
>>>                     case). Could you please paste your Infinispan
>>>                     configuration? Maybe there's some hint there...
>>>
>>>                     Thanks,
>>>                     Seb
>>>
>>>                     On Tue, Sep 18, 2018 at 11:02 PM D V
>>>                     <dv at glyphy.com <mailto:dv at glyphy.com>> wrote:
>>>
>>>                         The issue was resolved in a somewhat
>>>                         unexpected way. I had a custom
>>>                         org.keycloak.storage.UserStorageProviderFactory
>>>                         SPI registered that returned providers
>>>                         implementing org.keycloak.storage.user.UserLookupProvider,
>>>                         but org.keycloak.storage.user.UserLookupProvider#getUserById
>>>                         method wasn't fully filled out. I just had
>>>                         it return null. It wasn't obvious to me that
>>>                         it was required (or under what
>>>                         circumstances). Once I implemented it, the
>>>                         experiments in my original message passed. I
>>>                         did have to set owners to 2 for the
>>>                         "sessions" and "clientSessions" distributed
>>>                         cache infinispan configs.
>>>
>>>                         One thing I noticed is that node2 (the one
>>>                         that doesn't get hit on the initial password
>>>                         auth) has to do a lookup via getUserById the
>>>                         first time it handles a
>>>                         grant_type=refresh_token auth. Is the data
>>>                         it needs not shared across the cluster? It
>>>                         seems to be cached only locally on the node.
>>>                         Just as a test I tried to set all configured
>>>                         non-local caches to be replicated and it
>>>                         didn't help. Any thoughts about this?
>>>
>>>                         Thanks,
>>>                         DV
>>>
>>
>