[keycloak-user] Standalone HA tokens not immediately shared among nodes

Thu Sep 20 10:26:57 EDT 2018

OK. So, with all caches being replicated, there's an error on startup:

2018-09-20 14:03:38,307 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl]
(ServerService Thread Pool -- 62) ISPN000073: Unexpected error while
replicating: org.infinispan.commons.marshall.NotSerializableException:
org.keycloak.models.PasswordPolicy$Builder
Caused by: an exception which occurred:
in field org.keycloak.models.PasswordPolicy.builder
in object org.keycloak.models.PasswordPolicy at 6ab5350d
in field
org.keycloak.models.cache.infinispan.entities.CachedRealm.passwordPolicy
in object org.keycloak.models.cache.infinispan.entities.CachedRealm at 7864be21
in object org.keycloak.models.cache.infinispan.entities.CachedRealm at 7864be21
in object org.infinispan.commands.write.PutKeyValueCommand at fec4dc5e
in object org.infinispan.commands.remote.SingleRpcCommand at 3f2e5d1a

If I make the realms cache local but leave the rest replicated, I observe
the same behaviour: the node that didn't issue the original set of
refresh/access tokens does a getUserById lookup, which in my case results
in a network call against a remote service.

I noticed there are caches running that aren't mentioned in the config,
like userRevisions. These are local and adding them to the config as
replicated doesn't actually make them as such.

On Thu, Sep 20, 2018 at 7:36 AM Sebastian Laskawiec <slaskawi at redhat.com>
wrote:

> Could you please try to unify the caches? Please replace all local-cache
> and distributed-cache with replicated-cache.
>
> Even though using distributed caches over replicated ones should be the
> cause, I think those local caches might cause the behavior you're
> describing.
>
> On Wed, Sep 19, 2018 at 3:21 PM D V <dv at glyphy.com> wrote:
>
>> Makes sense re: replicated caches. Here's my infinispan subsystem config
>> right now:
>>
>>         <subsystem xmlns="urn:jboss:domain:infinispan:4.0">
>>             <cache-container name="keycloak"
>> jndi-name="infinispan/Keycloak" statistics-enabled="true">
>>                 <transport lock-timeout="60000"/>
>>                 <local-cache name="realms" statistics-enabled="true">
>>                     <eviction max-entries="10000" strategy="LRU"/>
>>                 </local-cache>
>>                 <local-cache name="users" statistics-enabled="true">
>>                     <eviction max-entries="10000" strategy="LRU"/>
>>                 </local-cache>
>>
>>                 <!--
>>                 These two need to be replicated or the node that didn't
>> issue the initial refresh token
>>                 will return "invalid_grant" errors when attempting to
>> auth with that refresh token.
>>                 -->
>>                 <replicated-cache name="sessions"
>> statistics-enabled="true"/>
>>                 <replicated-cache name="clientSessions"
>> statistics-enabled="true"/>
>>
>>                 <distributed-cache name="authenticationSessions"
>> mode="SYNC" owners="1" statistics-enabled="true"/>
>>                 <distributed-cache name="offlineSessions" mode="SYNC"
>> owners="1" statistics-enabled="true"/>
>>                 <distributed-cache name="offlineClientSessions"
>> mode="SYNC" owners="1" statistics-enabled="true"/>
>>                 <distributed-cache name="loginFailures" mode="SYNC"
>> owners="1" statistics-enabled="true"/>
>>                 <local-cache name="authorization"
>> statistics-enabled="true">
>>                     <eviction max-entries="10000" strategy="LRU"/>
>>                 </local-cache>
>>                 <replicated-cache name="work" mode="SYNC"
>> statistics-enabled="true"/>
>>                 <local-cache name="keys" statistics-enabled="true">
>>                     <eviction max-entries="1000" strategy="LRU"/>
>>                     <expiration max-idle="3600000"/>
>>                 </local-cache>
>>                 <distributed-cache name="actionTokens" mode="SYNC"
>> owners="2" statistics-enabled="true">
>>                     <eviction max-entries="-1" strategy="NONE"/>
>>                     <expiration max-idle="-1" interval="300000"/>
>>                 </distributed-cache>
>>             </cache-container>
>>             <cache-container name="server" aliases="singleton cluster"
>> default-cache="default" module="org.wildfly.clustering.server">
>>                 <transport lock-timeout="60000"/>
>>                 <replicated-cache name="default">
>>                     <transaction mode="BATCH"/>
>>                 </replicated-cache>
>>             </cache-container>
>>             <cache-container name="web" default-cache="dist"
>> module="org.wildfly.clustering.web.infinispan">
>>                 <transport lock-timeout="60000"/>
>>                 <distributed-cache name="dist">
>>                     <locking isolation="REPEATABLE_READ"/>
>>                     <transaction mode="BATCH"/>
>>                     <file-store/>
>>                 </distributed-cache>
>>             </cache-container>
>>             <cache-container name="ejb" aliases="sfsb"
>> default-cache="dist" module="org.wildfly.clustering.ejb.infinispan">
>>                 <transport lock-timeout="60000"/>
>>                 <distributed-cache name="dist">
>>                     <locking isolation="REPEATABLE_READ"/>
>>                     <transaction mode="BATCH"/>
>>                     <file-store/>
>>                 </distributed-cache>
>>             </cache-container>
>>             <cache-container name="hibernate" default-cache="local-query"
>> module="org.hibernate.infinispan">
>>                 <transport lock-timeout="60000"/>
>>                 <local-cache name="local-query">
>>                     <eviction strategy="LRU" max-entries="10000"/>
>>                     <expiration max-idle="100000"/>
>>                 </local-cache>
>>                 <invalidation-cache name="entity">
>>                     <transaction mode="NON_XA"/>
>>                     <eviction strategy="LRU" max-entries="10000"/>
>>                     <expiration max-idle="100000"/>
>>                 </invalidation-cache>
>>                 <replicated-cache name="timestamps" mode="ASYNC"/>
>>             </cache-container>
>>         </subsystem>
>>
>> The scenario I'm testing:
>> 1. Auth with grant_type=password on node1.
>> 2. Shut down node1.
>> 3. Auth with grant_type=refresh_token on node2.
>>
>> When client_sessions is not replicated (distributed, with owners=1, as in
>> the distribution's standalone-ha.xml), I get this on node2:
>> {
>>     "error": "invalid_grant",
>>     "error_description": "Session doesn't have required client"
>> }
>>
>> When sessions is not replicated:
>> {
>>     "error": "invalid_grant",
>>     "error_description": "Session not active"
>> }
>>
>> On Wed, Sep 19, 2018 at 6:56 AM Sebastian Laskawiec <slaskawi at redhat.com>
>> wrote:
>>
>>> Thanks for letting us know DV!
>>>
>>> Setting the number of owners equal to the cluster size doesn't make any
>>> sense. You might use a replicated cache in that scenarios (which works the
>>> same way apart from some Infinispan internal behavior, which can be omitted
>>> in your case). Could you please paste your Infinispan configuration? Maybe
>>> there's some hint there...
>>>
>>> Thanks,
>>> Seb
>>>
>>> On Tue, Sep 18, 2018 at 11:02 PM D V <dv at glyphy.com> wrote:
>>>
>>>> The issue was resolved in a somewhat unexpected way. I had a custom
>>>> org.keycloak.storage.UserStorageProviderFactory SPI registered that
>>>> returned providers
>>>> implementing org.keycloak.storage.user.UserLookupProvider,
>>>> but org.keycloak.storage.user.UserLookupProvider#getUserById method wasn't
>>>> fully filled out. I just had it return null. It wasn't obvious to me that
>>>> it was required (or under what circumstances). Once I implemented it, the
>>>> experiments in my original message passed. I did have to set owners to 2
>>>> for the "sessions" and "clientSessions" distributed cache infinispan
>>>> configs.
>>>>
>>>> One thing I noticed is that node2 (the one that doesn't get hit on the
>>>> initial password auth) has to do a lookup via getUserById the first time it
>>>> handles a grant_type=refresh_token auth. Is the data it needs not shared
>>>> across the cluster? It seems to be cached only locally on the node. Just as
>>>> a test I tried to set all configured non-local caches to be replicated and
>>>> it didn't help. Any thoughts about this?
>>>>
>>>> Thanks,
>>>> DV
>>>>
>>>>>