Re: [keycloak-user] OOM at startup with 1 million sessions

Tuesday, 20 November 2018

On 20/11/2018 16:33, William Burns wrote:
...

 ----- Original Message -----
> From: "Nicolas Ocquidant" <nocquidant(a)gmail.com&gt;
> To: mposolda(a)redhat.com
> Cc: slaskawi(a)redhat.com, wburns(a)redhat.com, keycloak-user(a)lists.jboss.org
> Sent: Tuesday, November 20, 2018 8:41:58 AM
> Subject: Re: [keycloak-user] OOM at startup with 1 million sessions
>
> Actually, the OOM does not come right after remoteCache#size, and maybe it
> consumes up to 3GB because I have enough heap space (my laptop has 16GB
> RAM)...
>
> HeapDumpOnOutOfMemory from the ISPN process shows that the heap is full of
> byte[3][] objects (4 threads):
> https://gist.github.com/nocquidant/4e84eb8d704b56221a9865cb63951734
 Okay, I did some digging and it looks like the fetch size is ignored when running the
select against postgres. The cause is postgres doesn't do paging unless auto-commit is
disabled on the connection [1]. This is causing all the entries to be in memory at once.
This will be fixed in [2].

 Also a side note for keycloak, I would recommend adding some backpressure support at [3]
- if the puts are too slow you could end up with a large amount of entries in memory at
the same time.

 [1] https://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
 [2] https://issues.jboss.org/browse/ISPN-9752
 [3]
https://github.com/keycloak/keycloak/blob/4.6.0.Final/model/infinispan/sr...

Thanks for that point. Maybe we can improve this by calling 
"cache.putAll" instead? It is synchronous operation and should help with 
the "backpressure" and at the same time, it doesn't send big number of 
messages across network. I mean if I have distributed cache with 2 
owners, then calling "cache.putAll" with 100 records won't send 50 
messages across the network to the second cluster node, but instead just 
1 message with 50 records (remaining 50 records will be saved on local 
cluster node and no need to send over the network)?

Another possibility is to remove a need for preloading entirely - this 
can be possible if we have a way to skip saving sessions on KC side at 
all and have them saved just on remote JDG server. Not yet sure if this 
is possible... Looking forward for next week to discuss with you :)

Marek

...

> According to me, OOM occurs in the following method (KC side), when getting
> from the future:
>
>      protected void startLoadingImpl(InitializerState state,
> SessionLoader.LoaderContext ctx) {
>          // Assume each worker has same processor's count
>          int processors = Runtime.getRuntime().availableProcessors();
>
>          ExecutorService localExecutor = Executors.newCachedThreadPool();
>          Transport transport = workCache.getCacheManager().getTransport();
>          boolean distributed = transport != null;
>          ExecutorService executorService = distributed ? new
> DefaultExecutorService(workCache, localExecutor) : localExecutor;
>
>          int errors = 0;
>
>          try {
>              while (!state.isFinished()) {
>                  int nodesCount = transport==null ? 1 :
> transport.getMembers().size();
>                  int distributedWorkersCount = processors * nodesCount;
>
>                  log.debugf("Starting next iteration with %d workers",
> distributedWorkersCount);
>
>                  List<Integer> segments =
> state.getUnfinishedSegments(distributedWorkersCount);
>
>                  if (log.isTraceEnabled()) {
>                      log.trace("unfinished segments for this iteration: "
+
> segments);
>                  }
>
>                  List<Future<WorkerResult>> futures = new
LinkedList<>();
>                  for (Integer segment : segments) {
>                      SessionInitializerWorker worker = new
> SessionInitializerWorker();
>                      worker.setWorkerEnvironment(segment, ctx,
> sessionLoader);
>                      if (!distributed) {
>                          worker.setEnvironment(workCache, null);
>                      }
>
>                      Future<WorkerResult> future =
> executorService.submit(worker);
>                      futures.add(future);
>                  }
>
>                  for (Future<WorkerResult> future : futures) {
>                      try {
>                          WorkerResult result = future.get(); <--- HERE
>
>                          if (result.getSuccess()) {
>                              int computedSegment = result.getSegment();
>                              state.markSegmentFinished(computedSegment);
>                          } else {
>                              if (log.isTraceEnabled()) {
>                                  log.tracef("Segment %d failed to
compute",
> result.getSegment());
>                              }
>                          }
>                      } catch (InterruptedException ie) {
>                          errors++;
>                          log.error("Interruped exception when computed
> future. Errors: " + errors, ie);
>                      } catch (ExecutionException ee) {
>                          errors++;
>                          log.error("ExecutionException when computed future.
> Errors: " + errors, ee);
>                      }
>                  }
>
>                  if (errors >= maxErrors) {
>                      throw new RuntimeException("Maximum count of worker
> errors occured. Limit was " + maxErrors + ". See server.log for
details");
>                  }
>
>                  saveStateToCache(state);
>
>                  log.debugf("New initializer state pushed. The state is:
> %s", state);
>              }
>
>              // Loader callback after the task is finished
>              this.sessionLoader.afterAllSessionsLoaded(this);
>
>          } finally {
>              if (distributed) {
>                  executorService.shutdown();
>              }
>              localExecutor.shutdown();
>          }
>      }
>
>
>
>
> Le mar. 20 nov. 2018 à 14:11, Marek Posolda <mposolda(a)redhat.com&gt; a écrit :
>
>> On 20/11/2018 13:42, Sebastian Laskawiec wrote:
>>
>> If I'm not mistaken (+William Burns <wburns(a)redhat.com&gt; is an expert
>> here), invoking RemoteCache#size() method requires to pull all entries from
>> the store into the memory. If so, that's why it blows up.
 To confirm Infinispan iteration only pulls a batch at a time. In 9.4 we are using rxjava2
Publisher which batches 128 entries at a time. Unfortunately, the fetch size on the JDBC
connection alone was not enough to have the JDBC driver batch though :P

>> The only way that comes into my mind is to get rid of RemoteCache#size()
>> calls from Keycloak (+Marek Posolda <mposolda(a)redhat.com&gt; WDYT, is it
>> doable?).
>>
>> Hmm... I think that remoteCache#size doesn't need to load the entries to
>> the memory? I see that HotRod protocol has "size" operation and that
one
>> seem to be called under the covers when you call RemoteCache#size (on
>> client side, the Java class is
>> org.infinispan.client.hotrod.impl.operations.SizeOperation ).
>>
>> I am not sure how much memory is needed to have 1 million of user
>> sessions. We plan to have some tests for this, but didn't test yet this
>> with the cross-dc enabled. Do you have both JDG and Keycloak on same
>> machine? Maybe it will help to increase or decrease the value of "Xmx"
for
>> both JDG and Keycloak. AFAIK if it's too big, it means that garbage
>> collectors are not called very often and new records are added to the
>> memory and JDK process doesn't have enough space to handle it. But not 100%
>> sure if that can help.
>>
>> Marek
>>
>>
>> I'm just wondering if adding more memory to the ISPN process is an option
>> in your case?
>>
>> On Tue, Nov 20, 2018 at 11:12 AM Nicolas Ocquidant <nocquidant(a)gmail.com&gt;
>> wrote:
>>
>>> Hi
>>>
>>> My UC is to keep one year of sessions, this is what I have done to
>>> simulate
>>> it:
>>>
>>> 1. I use JDBC to store 1 million of session objects, 2.8KB (in memory)
>>> each
>>> 2. I start one Infinispan node with passivation=false and shared=true, and
>>> Xmx=8G
>>> 3. I start one Keycloak node configured with a remote-cache and Xmx=4G
>>>
>>> Note that I use a remote-store in KC as it is the only way to set
>>> passivation=false and shared=true (see
>>> http://lists.jboss.org/pipermail/keycloak-user/2018-November/016180.html
>>> ).
>>>
>>> No problem in step 2, ISPN process is less than 300MB large in memory. But
>>> after 3, ISPN process goes up to around 6GB.
>>>
>>> See below for the traces, but basically I get OOM.
>>>
>>> Using the debugger, I can see that getting the size of the cache is a
>>> really slow operation, and bump memory to 3GB in ISPN process.
>>>
>>> From
>>>
>>>
org.keycloak.models.sessions.infinispan.initializer.InfinispanCacheInitializer:
>>>
>>> RemoteCacheSessionsLoader.computeLoaderContext(KeycloakSession session)
>>>     // ...
>>>      int sessionsTotal = remoteCache.size(); <--- HERE
>>>     //...
>>> }
>>>
>>> And then (OOM here):
>>>
>>> InfinispanCacheInitializer.startLoadingImpl(InitializerState state,
>>> SessionLoader.LoaderContext ctx) {
>>>    // ...
>>>     for (Future<WorkerResult> future : futures) {
>>>       // Called 4X (ie the number of segments), but 1st one does not
>>> terminate: OOM
>>>       // ->
>>> org.infinispan.distexec.DefaultExecutorService$LocalDistributedTaskPart
>>>      WorkerResult result = future.get(); <--- Very slow and bump mem to
8GB
>>>       // ...
>>> }
>>>
>>> So, with 1M of sessions in store, I cannot get KC/ISPN to start. And I am
>>> far from my goal which is to keep one year of sessions (which has been
>>> estimated to ~52M of sessions)...
>>>
>>> Is it something I can't achieve with KC/ISPN?
>>> Any help appreciated.
>>>
>>> Thanks
>>> nick
>>>
>>> Note, versions I used are:
>>>
>>> * ISPN 9.4.1 Wildfly
>>> * KC 4.6.0 Wildfly
>>>
>>> -- ISPN process
>>>
>>> 10:42:22,969 ERROR [stderr] (Periodic Recovery) Exception in thread
>>> "Periodic Recovery" java.lang.OutOfMemoryError: Java heap space
>>> 10:42:22,977 ERROR [stderr] (Periodic Recovery)         at
>>> sun.text.resources.fr.FormatData_fr.getContents(FormatData_fr.java:86)
>>> 10:42:22,975 ERROR
>>>
>>> [org.infinispan.persistence.jdbc.connectionfactory.ManagedConnectionFactory]
>>> (pool-9-thread-1) ISPN008018: Sql failure retrieving connection from
>>> datasource: java.sql.SQLException: javax.resource.ResourceException:
>>> IJ000456: Unchecked throwable in ManagedConnection.getConnection()
>>>
>>>
cl=org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@3e3890ff
>>> [state=NORMAL
>>> managed
>>>
>>> connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@3f033728
>>> connection handles=0 lastReturned=1542706942971
>>> lastValidated=1542706738387
>>> lastCheckedOut=1542706916019 trackByTx=false
>>> pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool@2772dcff
>>> mcp=SemaphoreConcurrentLinkedQueueManagedConnectionPool@5e82b87d
>>> [pool=InfinispanDS]
>>> xaResource=LocalXAResourceImpl@308d2c98[connectionListener=3e3890ff
>>> connectionManager=1691f7d6 warned=false currentXid=null
>>> productName=PostgreSQL productVersion=10.5
>>> jndiName=java:jboss/datasources/InfinispanDS] txSync=null]
>>>          at
>>>
>>>
org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:146)
>>>          at
>>>
>>>
org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64)
>>>          at
>>>
>>>
org.infinispan.persistence.jdbc.connectionfactory.ManagedConnectionFactory.getConnection(ManagedConnectionFactory.java:83)
>>>          at
>>>
>>>
org.infinispan.persistence.jdbc.stringbased.JdbcStringBasedStore.purge(JdbcStringBasedStore.java:461)
>>>          at
>>>
>>>
org.infinispan.persistence.manager.PersistenceManagerImpl.lambda$purgeExpired$6(PersistenceManagerImpl.java:459)
>>>          at java.util.ArrayList.forEach(ArrayList.java:1257)
>>>          at
>>>
>>>
org.infinispan.persistence.manager.PersistenceManagerImpl.purgeExpired(PersistenceManagerImpl.java:462)
>>>          at
>>>
>>>
org.infinispan.expiration.impl.ClusterExpirationManager.processExpiration(ClusterExpirationManager.java:119)
>>>          at
>>>
>>>
org.infinispan.expiration.impl.ExpirationManagerImpl$ScheduledTask.run(ExpirationManagerImpl.java:245)
>>>          at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>          at
>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>>          at
>>>
>>>
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>>          at
>>>
>>>
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>>          at
>>>
>>>
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>          at
>>>
>>>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>          at java.lang.Thread.run(Thread.java:748)
>>> Caused by: javax.resource.ResourceException: IJ000456: Unchecked throwable
>>> in ManagedConnection.getConnection()
>>>
>>>
cl=org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@3e3890ff
>>> [state=NORMAL
>>> managed
>>>
>>> connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@3f033728
>>> connection handles=0 lastReturned=1542706942971
>>> lastValidated=1542706738387
>>> lastCheckedOut=1542706916019 trackByTx=false
>>> pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool@2772dcff
>>> mcp=SemaphoreConcurrentLinkedQueueManagedConnectionPool@5e82b87d
>>> [pool=InfinispanDS]
>>> xaResource=LocalXAResourceImpl@308d2c98[connectionListener=3e3890ff
>>> connectionManager=1691f7d6 warned=false currentXid=null
>>> productName=PostgreSQL productVersion=10.5
>>> jndiName=java:jboss/datasources/InfinispanDS] txSync=null]
>>>          at
>>>
>>>
org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:811)
>>>          at
>>>
>>>
org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:138)
>>>          ... 15 more
>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>
>>>
>>> -- KC process
>>>
>>> 10:42:23,216 WARN  [org.infinispan.client.hotrod.impl.protocol.Codec21]
>>> (Thread-0) ISPN004005: Error received from the server:
>>> java.lang.RuntimeException: java.sql.SQLException: Error
>>>
>>> java.sql.SQLException: Error
>>> java.lang.OutOfMemoryError: Java heap space
>>> 10:42:23,359 WARN
>>>
>>>
[org.keycloak.models.sessions.infinispan.remotestore.RemoteCacheSessionsLoader]
>>> (pool-16-thread-4) Error loading sessions from remote cache
'sessions' for
>>> segment '3':
>>> org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for
>>> messageId=55 returned server error (status=0x85):
>>> java.lang.RuntimeException: java.sql.SQLException: Error
>>> java.sql.SQLException: Error
>>> java.lang.OutOfMemoryError: Java heap space
>>>          at
>>>
>>>
org.infinispan.client.hotrod.impl.protocol.Codec20.checkForErrorsInResponseStatus(Codec20.java:333)
>>>          at
>>>
>>>
org.infinispan.client.hotrod.impl.protocol.Codec20.readHeader(Codec20.java:179)
>>>          at
>>>
>>>
org.infinispan.client.hotrod.impl.transport.netty.HeaderDecoder.decode(HeaderDecoder.java:138)
>>>          at
>>>
>>>
org.infinispan.client.hotrod.impl.transport.netty.HintedReplayingDecoder.callDecode(HintedReplayingDecoder.java:98)
>>>          at
>>>
>>>
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>          at
>>>
>>>
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>          at
>>>
>>>
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>          at
>>>
>>>
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>          at
>>>
>>>
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
>>>          at
>>>
>>>
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>>>          at
>>>
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:647)
>>>          at
>>>
>>>
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:582)
>>>          at
>>>
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
>>>          at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:461)
>>>          at
>>>
>>>
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>>>          at
>>>
>>>
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>          at
>>>
>>>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>          at java.lang.Thread.run(Thread.java:748)
>>> _______________________________________________
>>> keycloak-user mailing list
>>> keycloak-user(a)lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/keycloak-user
>>>
>> 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

Re: [keycloak-user] OOM at startup with 1 million sessions