Thanks for the responses, folks. The issue now isn't the inability to set
all caches to replicated. It's that a get-user-by-id is called whenever a
node has to process a an authentication via a refresh token that wasn't
also issued by that same node. See the last paragraph of
.
The results are cached, but only on the original issuing node. I was
expecting the user-by-id information to be shared between keycloak nodes to
avoid external service calls, but perhaps this is by design? If so, could
you explain why?
On Tue, Sep 25, 2018 at 9:14 AM Marek Posolda <mposolda(a)redhat.com> wrote:
Some more info about our caches:
https://www.keycloak.org/docs/latest/server_installation/index.html#cache...
Not sure if this info should be updated and some more things to be
clearified?
Marek
On 25/09/18 15:12, Marek Posolda wrote:
Sorry, I did not read whole thread.
Just a quick note, that caches "realms", "users", "keys"
and
"authorization" are supposed to be local caches. The pattern, we're using
ATM is, that every cluster node caches it's data (realms, users etc)
locally. In case that some objects are updated (EG. realm or users), there
is separate cache "work", which make sure to notify other cluster nodes (or
even nodes on all the other DCs), so all the nodes can invalidate
particular cached object from their caches.
Caches "realms", "users", "keys" and
"authorization" are not meant to be
replicated/distributed, but local. So this NotSerializableException doesn't
look like a bug to me.
Marek
On 25/09/18 12:06, Sebastian Laskawiec wrote:
Thanks a lot for checking this.
This seems like a bug to me, so I filled
https://issues.jboss.org/browse/KEYCLOAK-8415. Unfortunately, we are
preparing for some urgent work on the product side and I can't promise you
when we will be able to look into this. I highly encourage you to
contribute a fix if you are in hurry or just subscribe to the ticket and
wait till we find a free slot to get it fixed.
Thanks,
Sebastian
On Thu, Sep 20, 2018 at 4:27 PM D V <dv(a)glyphy.com> wrote:
> OK. So, with all caches being replicated, there's an error on startup:
>
> 2018-09-20 14:03:38,307 ERROR
> [org.infinispan.remoting.rpc.RpcManagerImpl] (ServerService Thread Pool --
> 62) ISPN000073: Unexpected error while replicating:
> org.infinispan.commons.marshall.NotSerializableException:
> org.keycloak.models.PasswordPolicy$Builder
> Caused by: an exception which occurred:
> in field org.keycloak.models.PasswordPolicy.builder
> in object org.keycloak.models.PasswordPolicy@6ab5350d
> in field
> org.keycloak.models.cache.infinispan.entities.CachedRealm.passwordPolicy
> in object
> org.keycloak.models.cache.infinispan.entities.CachedRealm@7864be21
> in object
> org.keycloak.models.cache.infinispan.entities.CachedRealm@7864be21
> in object org.infinispan.commands.write.PutKeyValueCommand@fec4dc5e
> in object org.infinispan.commands.remote.SingleRpcCommand@3f2e5d1a
>
> If I make the realms cache local but leave the rest replicated, I observe
> the same behaviour: the node that didn't issue the original set of
> refresh/access tokens does a getUserById lookup, which in my case results
> in a network call against a remote service.
>
> I noticed there are caches running that aren't mentioned in the config,
> like userRevisions. These are local and adding them to the config as
> replicated doesn't actually make them as such.
>
> On Thu, Sep 20, 2018 at 7:36 AM Sebastian Laskawiec <slaskawi(a)redhat.com>
> wrote:
>
>> Could you please try to unify the caches? Please replace all local-cache
>> and distributed-cache with replicated-cache.
>>
>> Even though using distributed caches over replicated ones should be the
>> cause, I think those local caches might cause the behavior you're
>> describing.
>>
>> On Wed, Sep 19, 2018 at 3:21 PM D V <dv(a)glyphy.com> wrote:
>>
>>> Makes sense re: replicated caches. Here's my infinispan subsystem
>>> config right now:
>>>
>>> <subsystem xmlns="urn:jboss:domain:infinispan:4.0">
>>> <cache-container name="keycloak"
>>> jndi-name="infinispan/Keycloak"
statistics-enabled="true">
>>> <transport lock-timeout="60000"/>
>>> <local-cache name="realms"
statistics-enabled="true">
>>> <eviction max-entries="10000"
strategy="LRU"/>
>>> </local-cache>
>>> <local-cache name="users"
statistics-enabled="true">
>>> <eviction max-entries="10000"
strategy="LRU"/>
>>> </local-cache>
>>>
>>> <!--
>>> These two need to be replicated or the node that didn't
>>> issue the initial refresh token
>>> will return "invalid_grant" errors when attempting
to
>>> auth with that refresh token.
>>> -->
>>> <replicated-cache name="sessions"
>>> statistics-enabled="true"/>
>>> <replicated-cache name="clientSessions"
>>> statistics-enabled="true"/>
>>>
>>> <distributed-cache
name="authenticationSessions"
>>> mode="SYNC" owners="1"
statistics-enabled="true"/>
>>> <distributed-cache name="offlineSessions"
mode="SYNC"
>>> owners="1" statistics-enabled="true"/>
>>> <distributed-cache name="offlineClientSessions"
>>> mode="SYNC" owners="1"
statistics-enabled="true"/>
>>> <distributed-cache name="loginFailures"
mode="SYNC"
>>> owners="1" statistics-enabled="true"/>
>>> <local-cache name="authorization"
>>> statistics-enabled="true">
>>> <eviction max-entries="10000"
strategy="LRU"/>
>>> </local-cache>
>>> <replicated-cache name="work"
mode="SYNC"
>>> statistics-enabled="true"/>
>>> <local-cache name="keys"
statistics-enabled="true">
>>> <eviction max-entries="1000"
strategy="LRU"/>
>>> <expiration max-idle="3600000"/>
>>> </local-cache>
>>> <distributed-cache name="actionTokens"
mode="SYNC"
>>> owners="2" statistics-enabled="true">
>>> <eviction max-entries="-1"
strategy="NONE"/>
>>> <expiration max-idle="-1"
interval="300000"/>
>>> </distributed-cache>
>>> </cache-container>
>>> <cache-container name="server"
aliases="singleton cluster"
>>> default-cache="default"
module="org.wildfly.clustering.server">
>>> <transport lock-timeout="60000"/>
>>> <replicated-cache name="default">
>>> <transaction mode="BATCH"/>
>>> </replicated-cache>
>>> </cache-container>
>>> <cache-container name="web"
default-cache="dist"
>>> module="org.wildfly.clustering.web.infinispan">
>>> <transport lock-timeout="60000"/>
>>> <distributed-cache name="dist">
>>> <locking isolation="REPEATABLE_READ"/>
>>> <transaction mode="BATCH"/>
>>> <file-store/>
>>> </distributed-cache>
>>> </cache-container>
>>> <cache-container name="ejb"
aliases="sfsb"
>>> default-cache="dist"
module="org.wildfly.clustering.ejb.infinispan">
>>> <transport lock-timeout="60000"/>
>>> <distributed-cache name="dist">
>>> <locking isolation="REPEATABLE_READ"/>
>>> <transaction mode="BATCH"/>
>>> <file-store/>
>>> </distributed-cache>
>>> </cache-container>
>>> <cache-container name="hibernate"
>>> default-cache="local-query"
module="org.hibernate.infinispan">
>>> <transport lock-timeout="60000"/>
>>> <local-cache name="local-query">
>>> <eviction strategy="LRU"
max-entries="10000"/>
>>> <expiration max-idle="100000"/>
>>> </local-cache>
>>> <invalidation-cache name="entity">
>>> <transaction mode="NON_XA"/>
>>> <eviction strategy="LRU"
max-entries="10000"/>
>>> <expiration max-idle="100000"/>
>>> </invalidation-cache>
>>> <replicated-cache name="timestamps"
mode="ASYNC"/>
>>> </cache-container>
>>> </subsystem>
>>>
>>> The scenario I'm testing:
>>> 1. Auth with grant_type=password on node1.
>>> 2. Shut down node1.
>>> 3. Auth with grant_type=refresh_token on node2.
>>>
>>> When client_sessions is not replicated (distributed, with owners=1, as
>>> in the distribution's standalone-ha.xml), I get this on node2:
>>> {
>>> "error": "invalid_grant",
>>> "error_description": "Session doesn't have required
client"
>>> }
>>>
>>> When sessions is not replicated:
>>> {
>>> "error": "invalid_grant",
>>> "error_description": "Session not active"
>>> }
>>>
>>> On Wed, Sep 19, 2018 at 6:56 AM Sebastian Laskawiec <
>>> slaskawi(a)redhat.com> wrote:
>>>
>>>> Thanks for letting us know DV!
>>>>
>>>> Setting the number of owners equal to the cluster size doesn't make
>>>> any sense. You might use a replicated cache in that scenarios (which
works
>>>> the same way apart from some Infinispan internal behavior, which can be
>>>> omitted in your case). Could you please paste your Infinispan
>>>> configuration? Maybe there's some hint there...
>>>>
>>>> Thanks,
>>>> Seb
>>>>
>>>> On Tue, Sep 18, 2018 at 11:02 PM D V <dv(a)glyphy.com> wrote:
>>>>
>>>>> The issue was resolved in a somewhat unexpected way. I had a custom
>>>>> org.keycloak.storage.UserStorageProviderFactory SPI registered that
>>>>> returned providers
>>>>> implementing org.keycloak.storage.user.UserLookupProvider,
>>>>> but org.keycloak.storage.user.UserLookupProvider#getUserById method
wasn't
>>>>> fully filled out. I just had it return null. It wasn't obvious to
me that
>>>>> it was required (or under what circumstances). Once I implemented it,
the
>>>>> experiments in my original message passed. I did have to set owners
to 2
>>>>> for the "sessions" and "clientSessions"
distributed cache infinispan
>>>>> configs.
>>>>>
>>>>> One thing I noticed is that node2 (the one that doesn't get hit
on
>>>>> the initial password auth) has to do a lookup via getUserById the
first
>>>>> time it handles a grant_type=refresh_token auth. Is the data it needs
not
>>>>> shared across the cluster? It seems to be cached only locally on the
node.
>>>>> Just as a test I tried to set all configured non-local caches to be
>>>>> replicated and it didn't help. Any thoughts about this?
>>>>>
>>>>> Thanks,
>>>>> DV
>>>>>
>>>>>>