[keycloak-user] Standalone HA tokens not immediately shared among nodes

Tue Sep 25 09:14:24 EDT 2018

Some more info about our caches: 
https://www.keycloak.org/docs/latest/server_installation/index.html#cache-configuration

Not sure if this info should be updated and some more things to be 
clearified?

Marek

On 25/09/18 15:12, Marek Posolda wrote:
> Sorry, I did not read whole thread.
>
> Just a quick note, that caches "realms", "users", "keys" and 
> "authorization" are supposed to be local caches. The pattern, we're 
> using ATM is, that every cluster node caches it's data (realms, users 
> etc) locally. In case that some objects are updated (EG. realm or 
> users), there is separate cache "work", which make sure to notify 
> other cluster nodes (or even nodes on all the other DCs), so all the 
> nodes can invalidate particular cached object from their caches.
>
> Caches "realms", "users", "keys" and "authorization" are not meant to 
> be replicated/distributed, but local. So this NotSerializableException 
> doesn't look like a bug to me.
>
> Marek
>
> On 25/09/18 12:06, Sebastian Laskawiec wrote:
>> Thanks a lot for checking this.
>>
>> This seems like a bug to me, so I filled 
>> https://issues.jboss.org/browse/KEYCLOAK-8415. Unfortunately, we are 
>> preparing for some urgent work on the product side and I can't 
>> promise you when we will be able to look into this. I highly 
>> encourage you to contribute a fix if you are in hurry or just 
>> subscribe to the ticket and wait till we find a free slot to get it 
>> fixed.
>>
>> Thanks,
>> Sebastian
>>
>> On Thu, Sep 20, 2018 at 4:27 PM D V <dv at glyphy.com 
>> <mailto:dv at glyphy.com>> wrote:
>>
>>     OK. So, with all caches being replicated, there's an error on
>>     startup:
>>
>>     2018-09-20 14:03:38,307 ERROR
>>     [org.infinispan.remoting.rpc.RpcManagerImpl] (ServerService
>>     Thread Pool -- 62) ISPN000073: Unexpected error while
>>     replicating:
>>     org.infinispan.commons.marshall.NotSerializableException:
>>     org.keycloak.models.PasswordPolicy$Builder
>>     Caused by: an exception which occurred:
>>     in field org.keycloak.models.PasswordPolicy.builder
>>     in object org.keycloak.models.PasswordPolicy at 6ab5350d
>>     in field
>>     org.keycloak.models.cache.infinispan.entities.CachedRealm.passwordPolicy
>>     in object
>>     org.keycloak.models.cache.infinispan.entities.CachedRealm at 7864be21
>>     in object
>>     org.keycloak.models.cache.infinispan.entities.CachedRealm at 7864be21
>>     in object org.infinispan.commands.write.PutKeyValueCommand at fec4dc5e
>>     in object org.infinispan.commands.remote.SingleRpcCommand at 3f2e5d1a
>>
>>     If I make the realms cache local but leave the rest replicated, I
>>     observe the same behaviour: the node that didn't issue the
>>     original set of refresh/access tokens does a getUserById lookup,
>>     which in my case results in a network call against a remote service.
>>
>>     I noticed there are caches running that aren't mentioned in the
>>     config, like userRevisions. These are local and adding them to
>>     the config as replicated doesn't actually make them as such.
>>
>>     On Thu, Sep 20, 2018 at 7:36 AM Sebastian Laskawiec
>>     <slaskawi at redhat.com <mailto:slaskawi at redhat.com>> wrote:
>>
>>         Could you please try to unify the caches? Please replace
>>         all local-cache and distributed-cache with replicated-cache.
>>
>>         Even though using distributed caches over replicated ones
>>         should be the cause, I think those local caches might cause
>>         the behavior you're describing.
>>
>>         On Wed, Sep 19, 2018 at 3:21 PM D V <dv at glyphy.com
>>         <mailto:dv at glyphy.com>> wrote:
>>
>>             Makes sense re: replicated caches. Here's my infinispan
>>             subsystem config right now:
>>
>>                     <subsystem xmlns="urn:jboss:domain:infinispan:4.0">
>>                         <cache-container name="keycloak"
>>             jndi-name="infinispan/Keycloak" statistics-enabled="true">
>>                             <transport lock-timeout="60000"/>
>>                             <local-cache name="realms"
>>             statistics-enabled="true">
>>                                 <eviction max-entries="10000"
>>             strategy="LRU"/>
>>                             </local-cache>
>>                             <local-cache name="users"
>>             statistics-enabled="true">
>>                                 <eviction max-entries="10000"
>>             strategy="LRU"/>
>>                             </local-cache>
>>
>>                             <!--
>>                             These two need to be replicated or the
>>             node that didn't issue the initial refresh token
>>                             will return "invalid_grant" errors when
>>             attempting to auth with that refresh token.
>>                             -->
>>                             <replicated-cache name="sessions"
>>             statistics-enabled="true"/>
>>                             <replicated-cache name="clientSessions"
>>             statistics-enabled="true"/>
>>
>>                             <distributed-cache
>>             name="authenticationSessions" mode="SYNC" owners="1"
>>             statistics-enabled="true"/>
>>                             <distributed-cache name="offlineSessions"
>>             mode="SYNC" owners="1" statistics-enabled="true"/>
>>                             <distributed-cache
>>             name="offlineClientSessions" mode="SYNC" owners="1"
>>             statistics-enabled="true"/>
>>                             <distributed-cache name="loginFailures"
>>             mode="SYNC" owners="1" statistics-enabled="true"/>
>>                             <local-cache name="authorization"
>>             statistics-enabled="true">
>>                                 <eviction max-entries="10000"
>>             strategy="LRU"/>
>>                             </local-cache>
>>                             <replicated-cache name="work" mode="SYNC"
>>             statistics-enabled="true"/>
>>                             <local-cache name="keys"
>>             statistics-enabled="true">
>>                                 <eviction max-entries="1000"
>>             strategy="LRU"/>
>>                                 <expiration max-idle="3600000"/>
>>                             </local-cache>
>>                             <distributed-cache name="actionTokens"
>>             mode="SYNC" owners="2" statistics-enabled="true">
>>                                 <eviction max-entries="-1"
>>             strategy="NONE"/>
>>                                 <expiration max-idle="-1"
>>             interval="300000"/>
>>             </distributed-cache>
>>                         </cache-container>
>>                         <cache-container name="server"
>>             aliases="singleton cluster" default-cache="default"
>>             module="org.wildfly.clustering.server">
>>                             <transport lock-timeout="60000"/>
>>                             <replicated-cache name="default">
>>                                 <transaction mode="BATCH"/>
>>             </replicated-cache>
>>                         </cache-container>
>>                         <cache-container name="web"
>>             default-cache="dist"
>>             module="org.wildfly.clustering.web.infinispan">
>>                             <transport lock-timeout="60000"/>
>>                             <distributed-cache name="dist">
>>                                 <locking isolation="REPEATABLE_READ"/>
>>                                 <transaction mode="BATCH"/>
>>             <file-store/>
>>             </distributed-cache>
>>                         </cache-container>
>>                         <cache-container name="ejb" aliases="sfsb"
>>             default-cache="dist"
>>             module="org.wildfly.clustering.ejb.infinispan">
>>                             <transport lock-timeout="60000"/>
>>                             <distributed-cache name="dist">
>>                                 <locking isolation="REPEATABLE_READ"/>
>>                                 <transaction mode="BATCH"/>
>>             <file-store/>
>>             </distributed-cache>
>>                         </cache-container>
>>                         <cache-container name="hibernate"
>>             default-cache="local-query"
>>             module="org.hibernate.infinispan">
>>                             <transport lock-timeout="60000"/>
>>                             <local-cache name="local-query">
>>                                 <eviction strategy="LRU"
>>             max-entries="10000"/>
>>                                 <expiration max-idle="100000"/>
>>                             </local-cache>
>>             <invalidation-cache name="entity">
>>                                 <transaction mode="NON_XA"/>
>>                                 <eviction strategy="LRU"
>>             max-entries="10000"/>
>>                                 <expiration max-idle="100000"/>
>>             </invalidation-cache>
>>                             <replicated-cache name="timestamps"
>>             mode="ASYNC"/>
>>                         </cache-container>
>>                     </subsystem>
>>
>>             The scenario I'm testing:
>>             1. Auth with grant_type=password on node1.
>>             2. Shut down node1.
>>             3. Auth with grant_type=refresh_token on node2.
>>
>>             When client_sessions is not replicated (distributed, with
>>             owners=1, as in the distribution's standalone-ha.xml), I
>>             get this on node2:
>>             {
>>                 "error": "invalid_grant",
>>                 "error_description": "Session doesn't have required
>>             client"
>>             }
>>
>>             When sessions is not replicated:
>>             {
>>                 "error": "invalid_grant",
>>                 "error_description": "Session not active"
>>             }
>>
>>             On Wed, Sep 19, 2018 at 6:56 AM Sebastian Laskawiec
>>             <slaskawi at redhat.com <mailto:slaskawi at redhat.com>> wrote:
>>
>>                 Thanks for letting us know DV!
>>
>>                 Setting the number of owners equal to the cluster
>>                 size doesn't make any sense. You might use a
>>                 replicated cache in that scenarios (which works the
>>                 same way apart from some Infinispan internal
>>                 behavior, which can be omitted in your case). Could
>>                 you please paste your Infinispan configuration? Maybe
>>                 there's some hint there...
>>
>>                 Thanks,
>>                 Seb
>>
>>                 On Tue, Sep 18, 2018 at 11:02 PM D V <dv at glyphy.com
>>                 <mailto:dv at glyphy.com>> wrote:
>>
>>                     The issue was resolved in a somewhat unexpected
>>                     way. I had a custom
>>                     org.keycloak.storage.UserStorageProviderFactory
>>                     SPI registered that returned providers
>>                     implementing org.keycloak.storage.user.UserLookupProvider,
>>                     but org.keycloak.storage.user.UserLookupProvider#getUserById
>>                     method wasn't fully filled out. I just had it
>>                     return null. It wasn't obvious to me that it was
>>                     required (or under what circumstances). Once I
>>                     implemented it, the experiments in my original
>>                     message passed. I did have to set owners to 2 for
>>                     the "sessions" and "clientSessions" distributed
>>                     cache infinispan configs.
>>
>>                     One thing I noticed is that node2 (the one that
>>                     doesn't get hit on the initial password auth) has
>>                     to do a lookup via getUserById the first time it
>>                     handles a grant_type=refresh_token auth. Is the
>>                     data it needs not shared across the cluster? It
>>                     seems to be cached only locally on the node. Just
>>                     as a test I tried to set all configured non-local
>>                     caches to be replicated and it didn't help. Any
>>                     thoughts about this?
>>
>>                     Thanks,
>>                     DV
>>
>