[keycloak-dev] Issues with clustered invalidation caches

Fri Nov 4 03:55:28 EDT 2016

Thanks for the tips. More inline

On 03/11/16 18:10, Bill Burke wrote:
> +1 to this.  We are already hacking Infinispan horribly and doing nasty
> work arounds to get the behavior we want.
>
> Some notes:
>
> * You have to be sure to invalidate indexes (email->user,
> username->user) correctly
>
> * We cache relationships in some scenarios, i.e. the list of clients in
> a realm.  Your proposal makes things so much easier for caching
> relationships as we know the difference between a realm delete and a
> realm update and a client create, update, and removal
>
> * Cache is currently not interacted with at all for queries that return
> multiple users (getUser(), searchForUser, etc.).  We invalidate the
> realm on these queries as we don't know if the caller is going to
> perform any updates on those users.
>
> * Role and group removal is a tricky thing as role/group mappings are
> cached with the user.  You still have the iteration problem discusssed
> earlier and you have to decide which is more efficient, evicting all
> users in the realm, or iterating over all users that might have this
> role/group mapping.  There's also client scope mapping, client mappers,
> and broker mappers which may also reference the role/group.
Yep, right now we are just evicting all users when any role/client/group 
is removed. JIRA created already to improve it 
https://issues.jboss.org/browse/KEYCLOAK-3860 for better handling of that.

Instead of evicting all users (which we have now) or iterating over all 
users, I was thinking about 3rd solution: Not evict any users straight 
away, but keep them in cache. So if role 123 was removed, users with 
this roleID will be still in cache. However when operation on user is 
called (eg. user.getRoleMappings ), then handle it in the UserAdapter 
itself and just ignore the removed role. For example 
user.getRoleMappings will be like:

for (String roleId : roleIds) {
     RoleModel role = session.realms().getRole(roleId);
     if (role == null) {
         // Assume role 123 was deleted, so just ignore it
         continue;
     } else {
         ...
     }
}

Same for clients, where we need to handle consents, client role mappings 
and maybe more...

Thing is, that there are some other issues with this solution like:
- Potential risk of regressions (NPE exceptions etc)
- Fact that we don't cache unsuccessful DB lookups right now. For 
example when I query DB for role "123" and DB returns null, we don't 
store anything in the cache ATM. So the information that "role 123 
doesn't exist" is not in the cache.
That's not very good for this solution as if we keep users with the 
outdated role mappings in the cache, there is chance that DB will be 
queried thousand of times for non-existing role "123" . So IMO we will 
need to somehow cache failed DB lookups too, if we want to go with this 
solution...

Anyway, it's probably ok to postpone 
https://issues.jboss.org/browse/KEYCLOAK-3860 and do it later. It looks 
like fair amount of work with potential regressions and it's not 
directly related to clustering and invalidation issues seen by customer, 
so fine to be postponed if we don't have time to do it now.

>
> * If a user storage provider is removed, this is also a tricky scenario
> and you may be better off evicting all users instead of iterating over
> all cached users to find all that were loaded by this provider.
+1, It seems that removing UserStorage provider will be very rare 
scenario? So evicting all users is probably safe. That's what we are 
doing already.
>
> * Client removal is tricky because you also have to remove all user role
> mappings, scope mappings, that are cached.
AFAIK we already have most of predicates done by you for invalidate all 
the "dependent" objects. However I will doublecheck... In particular, I 
am not sure if removing role also evicts broker mappers and 
protocolMappers right now...

Marek
>
> There may be other cases I'm missing.
>
>
> On 11/3/16 7:06 AM, Marek Posolda wrote:
>> I was looking at the cache issue reported by customer. I found the cause
>> of it and couple of other related issues:
>>
>> KEYCLOAK-3857 - Bad performance with clustered invalidation cache when
>> updating object
>> KEYCLOAK-3858 - Removing model object send lots of invalidation messages
>> across cluster
>> KEYCLOAK-3859 - Lots of userCache invalidation messages when
>> invalidating realm
>> KEYCLOAK-3860 - All realm users are invalidated from cache when removing
>> some realm object
>>
>>
>> In shortcut, our cache works fine in local mode. But in cluster, there
>> are issues with the invalidation caches . We don't have issues with
>> stale entries, but this is purchased but lots of various performance
>> issues like:
>>
>> - There are lots of invalidation messages sent across the cluster
>>
>> - Eviction on the node, which received invalidation event, is also very
>> uneffective. For example evicting realm with 1000 roles needs to call
>> 1000 predicates, which iterates the cache 1000 times.
>>
>> - Invalidation cache doesn't allow to differ between the context why the
>> object was invalidated. For example when I update realm settings on
>> node1, I need to invalidate just the CachedRealm object, but not all the
>> other objects dependent on the realm. However the invalidation event
>> received on node2 doesn't know, if I invalidated CachedRealm because of
>> realm update or because of realm removal. So for more safety, it assumes
>> removal, which evicts all realm objects! See
>> https://issues.jboss.org/browse/KEYCLOAK-3857 for details.
>>
>> - Finally we have the workaround with the "invalidation.key" objects in
>> our invalidation caches. This is currently needed because when
>> invalidating object on node1, the invalidation event is NOT received on
>> node2 unless the object is there. Hence the workaround with the
>> "invalidation.key" records just to avoid this limitation of invalidation
>> cache.
>>
>>
>> For solve all these issues, I propose:
>> - Instead of relying on invalidation caches, we will send notification
>> across cluster what happened (eg. message "realm XY was updated"). All
>> the nodes will receive this notification and will evict all their
>> locally cached objects accordingly and bump their revisions locally.
>> This would be much more stable, performant and will allow us to remove
>> some workarounds.
>>
>> Some details:
>>
>> - The caches "realms" and "users" won't be "invalidation" caches, but
>> they will be "local" caches.
>>
>> - When any object needs to be removed from cache because of some reason
>> (eg. updating realm), the notification message will be sent from node1
>> to all other cluster nodes. We will use the replicated cache for that.
>> Node1 will send the notification message like "realm XY was updated" .
>>
>> - Other cluster nodes will receive this message and they will locally
>> trigger evictions of all the objects dependent on particular realm. In
>> case of realm update, it's just the CachedRealm object itself. In case
>> of realm removal, it is all realm objects etc.
>>
>> - Note message will contain also context "realm XY was updated" or
>> "realm XY was removed" . Not just "invalidate realm XY". This allows
>> much more flexibility and in particular avoid the issues like
>> https://issues.jboss.org/browse/KEYCLOAK-3857 .
>>
>> - We already have replicated cache called "work", which we are using to
>> notify other cluster nodes about various events. So we will just use
>> this one. No need to add another replicated cache, we will probably just
>> need to configure LRU eviction for the existing one.
>>
>> - Also note that messages will be always received. We won't need
>> workaround with "invalidation.key" objects anymore.
>>
>> - Also we don't need recursive evictions (which has very poor
>> performance. See https://issues.jboss.org/browse/KEYCLOAK-3857 ),
>> because receiving node will know exactly what happened. It will remove
>> objects just the same way like the "sender" node.
>>
>> - Finally the amount of traffic sent across the cluster will be much lower.
>>
>> This sounds like the big step, but IMO it's not that bad :) Note that we
>> already have all the predicates in place for individual objects. The
>> only change will be about sending/receiving notifications across
>> cluster. I think I am able to prototype something by tomorrow to
>> doublecheck this approach working and then finish it somewhen middle
>> next week. WDYT?
>>
>> Marek
>>
>> _______________________________________________
>> keycloak-dev mailing list
>> keycloak-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/keycloak-dev
> _______________________________________________
> keycloak-dev mailing list
> keycloak-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/keycloak-dev