[keycloak-user] OOM at startup with 1 million sessions

Tue Nov 20 15:29:25 EST 2018

On 20/11/2018 16:33, William Burns wrote:
>
> ----- Original Message -----
>> From: "Nicolas Ocquidant" <nocquidant at gmail.com>
>> To: mposolda at redhat.com
>> Cc: slaskawi at redhat.com, wburns at redhat.com, keycloak-user at lists.jboss.org
>> Sent: Tuesday, November 20, 2018 8:41:58 AM
>> Subject: Re: [keycloak-user] OOM at startup with 1 million sessions
>>
>> Actually, the OOM does not come right after remoteCache#size, and maybe it
>> consumes up to 3GB because I have enough heap space (my laptop has 16GB
>> RAM)...
>>
>> HeapDumpOnOutOfMemory from the ISPN process shows that the heap is full of
>> byte[3][] objects (4 threads):
>> https://gist.github.com/nocquidant/4e84eb8d704b56221a9865cb63951734
> Okay, I did some digging and it looks like the fetch size is ignored when running the select against postgres. The cause is postgres doesn't do paging unless auto-commit is disabled on the connection [1]. This is causing all the entries to be in memory at once. This will be fixed in [2].
>
> Also a side note for keycloak, I would recommend adding some backpressure support at [3] - if the puts are too slow you could end up with a large amount of entries in memory at the same time.
>
> [1] https://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
> [2] https://issues.jboss.org/browse/ISPN-9752
> [3] https://github.com/keycloak/keycloak/blob/4.6.0.Final/model/infinispan/src/main/java/org/keycloak/models/sessions/infinispan/remotestore/RemoteCacheSessionsLoader.java#L111

Thanks for that point. Maybe we can improve this by calling 
"cache.putAll" instead? It is synchronous operation and should help with 
the "backpressure" and at the same time, it doesn't send big number of 
messages across network. I mean if I have distributed cache with 2 
owners, then calling "cache.putAll" with 100 records won't send 50 
messages across the network to the second cluster node, but instead just 
1 message with 50 records (remaining 50 records will be saved on local 
cluster node and no need to send over the network)?

Another possibility is to remove a need for preloading entirely - this 
can be possible if we have a way to skip saving sessions on KC side at 
all and have them saved just on remote JDG server. Not yet sure if this 
is possible... Looking forward for next week to discuss with you :)

Marek

>
>> According to me, OOM occurs in the following method (KC side), when getting
>> from the future:
>>
>>      protected void startLoadingImpl(InitializerState state,
>> SessionLoader.LoaderContext ctx) {
>>          // Assume each worker has same processor's count
>>          int processors = Runtime.getRuntime().availableProcessors();
>>
>>          ExecutorService localExecutor = Executors.newCachedThreadPool();
>>          Transport transport = workCache.getCacheManager().getTransport();
>>          boolean distributed = transport != null;
>>          ExecutorService executorService = distributed ? new
>> DefaultExecutorService(workCache, localExecutor) : localExecutor;
>>
>>          int errors = 0;
>>
>>          try {
>>              while (!state.isFinished()) {
>>                  int nodesCount = transport==null ? 1 :
>> transport.getMembers().size();
>>                  int distributedWorkersCount = processors * nodesCount;
>>
>>                  log.debugf("Starting next iteration with %d workers",
>> distributedWorkersCount);
>>
>>                  List<Integer> segments =
>> state.getUnfinishedSegments(distributedWorkersCount);
>>
>>                  if (log.isTraceEnabled()) {
>>                      log.trace("unfinished segments for this iteration: " +
>> segments);
>>                  }
>>
>>                  List<Future<WorkerResult>> futures = new LinkedList<>();
>>                  for (Integer segment : segments) {
>>                      SessionInitializerWorker worker = new
>> SessionInitializerWorker();
>>                      worker.setWorkerEnvironment(segment, ctx,
>> sessionLoader);
>>                      if (!distributed) {
>>                          worker.setEnvironment(workCache, null);
>>                      }
>>
>>                      Future<WorkerResult> future =
>> executorService.submit(worker);
>>                      futures.add(future);
>>                  }
>>
>>                  for (Future<WorkerResult> future : futures) {
>>                      try {
>>                          WorkerResult result = future.get(); <--- HERE
>>
>>                          if (result.getSuccess()) {
>>                              int computedSegment = result.getSegment();
>>                              state.markSegmentFinished(computedSegment);
>>                          } else {
>>                              if (log.isTraceEnabled()) {
>>                                  log.tracef("Segment %d failed to compute",
>> result.getSegment());
>>                              }
>>                          }
>>                      } catch (InterruptedException ie) {
>>                          errors++;
>>                          log.error("Interruped exception when computed
>> future. Errors: " + errors, ie);
>>                      } catch (ExecutionException ee) {
>>                          errors++;
>>                          log.error("ExecutionException when computed future.
>> Errors: " + errors, ee);
>>                      }
>>                  }
>>
>>                  if (errors >= maxErrors) {
>>                      throw new RuntimeException("Maximum count of worker
>> errors occured. Limit was " + maxErrors + ". See server.log for details");
>>                  }
>>
>>                  saveStateToCache(state);
>>
>>                  log.debugf("New initializer state pushed. The state is:
>> %s", state);
>>              }
>>
>>              // Loader callback after the task is finished
>>              this.sessionLoader.afterAllSessionsLoaded(this);
>>
>>          } finally {
>>              if (distributed) {
>>                  executorService.shutdown();
>>              }
>>              localExecutor.shutdown();
>>          }
>>      }
>>
>>
>>
>>
>> Le mar. 20 nov. 2018 à 14:11, Marek Posolda <mposolda at redhat.com> a écrit :
>>
>>> On 20/11/2018 13:42, Sebastian Laskawiec wrote:
>>>
>>> If I'm not mistaken (+William Burns <wburns at redhat.com> is an expert
>>> here), invoking RemoteCache#size() method requires to pull all entries from
>>> the store into the memory. If so, that's why it blows up.
> To confirm Infinispan iteration only pulls a batch at a time. In 9.4 we are using rxjava2 Publisher which batches 128 entries at a time. Unfortunately, the fetch size on the JDBC connection alone was not enough to have the JDBC driver batch though :P
>
>>> The only way that comes into my mind is to get rid of RemoteCache#size()
>>> calls from Keycloak (+Marek Posolda <mposolda at redhat.com> WDYT, is it
>>> doable?).
>>>
>>> Hmm... I think that remoteCache#size doesn't need to load the entries to
>>> the memory? I see that HotRod protocol has "size" operation and that one
>>> seem to be called under the covers when you call RemoteCache#size (on
>>> client side, the Java class is
>>> org.infinispan.client.hotrod.impl.operations.SizeOperation ).
>>>
>>> I am not sure how much memory is needed to have 1 million of user
>>> sessions. We plan to have some tests for this, but didn't test yet this
>>> with the cross-dc enabled. Do you have both JDG and Keycloak on same
>>> machine? Maybe it will help to increase or decrease the value of "Xmx" for
>>> both JDG and Keycloak. AFAIK if it's too big, it means that garbage
>>> collectors are not called very often and new records are added to the
>>> memory and JDK process doesn't have enough space to handle it. But not 100%
>>> sure if that can help.
>>>
>>> Marek
>>>
>>>
>>> I'm just wondering if adding more memory to the ISPN process is an option
>>> in your case?
>>>
>>> On Tue, Nov 20, 2018 at 11:12 AM Nicolas Ocquidant <nocquidant at gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> My UC is to keep one year of sessions, this is what I have done to
>>>> simulate
>>>> it:
>>>>
>>>> 1. I use JDBC to store 1 million of session objects, 2.8KB (in memory)
>>>> each
>>>> 2. I start one Infinispan node with passivation=false and shared=true, and
>>>> Xmx=8G
>>>> 3. I start one Keycloak node configured with a remote-cache and Xmx=4G
>>>>
>>>> Note that I use a remote-store in KC as it is the only way to set
>>>> passivation=false and shared=true (see
>>>> http://lists.jboss.org/pipermail/keycloak-user/2018-November/016180.html
>>>> ).
>>>>
>>>> No problem in step 2, ISPN process is less than 300MB large in memory. But
>>>> after 3, ISPN process goes up to around 6GB.
>>>>
>>>> See below for the traces, but basically I get OOM.
>>>>
>>>> Using the debugger, I can see that getting the size of the cache is a
>>>> really slow operation, and bump memory to 3GB in ISPN process.
>>>>
>>>> From
>>>>
>>>> org.keycloak.models.sessions.infinispan.initializer.InfinispanCacheInitializer:
>>>>
>>>> RemoteCacheSessionsLoader.computeLoaderContext(KeycloakSession session)
>>>>     // ...
>>>>      int sessionsTotal = remoteCache.size(); <--- HERE
>>>>     //...
>>>> }
>>>>
>>>> And then (OOM here):
>>>>
>>>> InfinispanCacheInitializer.startLoadingImpl(InitializerState state,
>>>> SessionLoader.LoaderContext ctx) {
>>>>    // ...
>>>>     for (Future<WorkerResult> future : futures) {
>>>>       // Called 4X (ie the number of segments), but 1st one does not
>>>> terminate: OOM
>>>>       // ->
>>>> org.infinispan.distexec.DefaultExecutorService$LocalDistributedTaskPart
>>>>      WorkerResult result = future.get(); <--- Very slow and bump mem to 8GB
>>>>       // ...
>>>> }
>>>>
>>>> So, with 1M of sessions in store, I cannot get KC/ISPN to start. And I am
>>>> far from my goal which is to keep one year of sessions (which has been
>>>> estimated to ~52M of sessions)...
>>>>
>>>> Is it something I can't achieve with KC/ISPN?
>>>> Any help appreciated.
>>>>
>>>> Thanks
>>>> nick
>>>>
>>>> Note, versions I used are:
>>>>
>>>> * ISPN 9.4.1 Wildfly
>>>> * KC 4.6.0 Wildfly
>>>>
>>>> -- ISPN process
>>>>
>>>> 10:42:22,969 ERROR [stderr] (Periodic Recovery) Exception in thread
>>>> "Periodic Recovery" java.lang.OutOfMemoryError: Java heap space
>>>> 10:42:22,977 ERROR [stderr] (Periodic Recovery)         at
>>>> sun.text.resources.fr.FormatData_fr.getContents(FormatData_fr.java:86)
>>>> 10:42:22,975 ERROR
>>>>
>>>> [org.infinispan.persistence.jdbc.connectionfactory.ManagedConnectionFactory]
>>>> (pool-9-thread-1) ISPN008018: Sql failure retrieving connection from
>>>> datasource: java.sql.SQLException: javax.resource.ResourceException:
>>>> IJ000456: Unchecked throwable in ManagedConnection.getConnection()
>>>>
>>>> cl=org.jboss.jca.core.connectionmanager.listener.TxConnectionListener at 3e3890ff
>>>> [state=NORMAL
>>>> managed
>>>>
>>>> connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection at 3f033728
>>>> connection handles=0 lastReturned=1542706942971
>>>> lastValidated=1542706738387
>>>> lastCheckedOut=1542706916019 trackByTx=false
>>>> pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool at 2772dcff
>>>> mcp=SemaphoreConcurrentLinkedQueueManagedConnectionPool at 5e82b87d
>>>> [pool=InfinispanDS]
>>>> xaResource=LocalXAResourceImpl at 308d2c98[connectionListener=3e3890ff
>>>> connectionManager=1691f7d6 warned=false currentXid=null
>>>> productName=PostgreSQL productVersion=10.5
>>>> jndiName=java:jboss/datasources/InfinispanDS] txSync=null]
>>>>          at
>>>>
>>>> org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:146)
>>>>          at
>>>>
>>>> org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64)
>>>>          at
>>>>
>>>> org.infinispan.persistence.jdbc.connectionfactory.ManagedConnectionFactory.getConnection(ManagedConnectionFactory.java:83)
>>>>          at
>>>>
>>>> org.infinispan.persistence.jdbc.stringbased.JdbcStringBasedStore.purge(JdbcStringBasedStore.java:461)
>>>>          at
>>>>
>>>> org.infinispan.persistence.manager.PersistenceManagerImpl.lambda$purgeExpired$6(PersistenceManagerImpl.java:459)
>>>>          at java.util.ArrayList.forEach(ArrayList.java:1257)
>>>>          at
>>>>
>>>> org.infinispan.persistence.manager.PersistenceManagerImpl.purgeExpired(PersistenceManagerImpl.java:462)
>>>>          at
>>>>
>>>> org.infinispan.expiration.impl.ClusterExpirationManager.processExpiration(ClusterExpirationManager.java:119)
>>>>          at
>>>>
>>>> org.infinispan.expiration.impl.ExpirationManagerImpl$ScheduledTask.run(ExpirationManagerImpl.java:245)
>>>>          at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>          at
>>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>>>          at
>>>>
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>>>          at
>>>>
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>>>          at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>          at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>          at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: javax.resource.ResourceException: IJ000456: Unchecked throwable
>>>> in ManagedConnection.getConnection()
>>>>
>>>> cl=org.jboss.jca.core.connectionmanager.listener.TxConnectionListener at 3e3890ff
>>>> [state=NORMAL
>>>> managed
>>>>
>>>> connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection at 3f033728
>>>> connection handles=0 lastReturned=1542706942971
>>>> lastValidated=1542706738387
>>>> lastCheckedOut=1542706916019 trackByTx=false
>>>> pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool at 2772dcff
>>>> mcp=SemaphoreConcurrentLinkedQueueManagedConnectionPool at 5e82b87d
>>>> [pool=InfinispanDS]
>>>> xaResource=LocalXAResourceImpl at 308d2c98[connectionListener=3e3890ff
>>>> connectionManager=1691f7d6 warned=false currentXid=null
>>>> productName=PostgreSQL productVersion=10.5
>>>> jndiName=java:jboss/datasources/InfinispanDS] txSync=null]
>>>>          at
>>>>
>>>> org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:811)
>>>>          at
>>>>
>>>> org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:138)
>>>>          ... 15 more
>>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>>
>>>>
>>>> -- KC process
>>>>
>>>> 10:42:23,216 WARN  [org.infinispan.client.hotrod.impl.protocol.Codec21]
>>>> (Thread-0) ISPN004005: Error received from the server:
>>>> java.lang.RuntimeException: java.sql.SQLException: Error
>>>>
>>>> java.sql.SQLException: Error
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> 10:42:23,359 WARN
>>>>
>>>> [org.keycloak.models.sessions.infinispan.remotestore.RemoteCacheSessionsLoader]
>>>> (pool-16-thread-4) Error loading sessions from remote cache 'sessions' for
>>>> segment '3':
>>>> org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for
>>>> messageId=55 returned server error (status=0x85):
>>>> java.lang.RuntimeException: java.sql.SQLException: Error
>>>> java.sql.SQLException: Error
>>>> java.lang.OutOfMemoryError: Java heap space
>>>>          at
>>>>
>>>> org.infinispan.client.hotrod.impl.protocol.Codec20.checkForErrorsInResponseStatus(Codec20.java:333)
>>>>          at
>>>>
>>>> org.infinispan.client.hotrod.impl.protocol.Codec20.readHeader(Codec20.java:179)
>>>>          at
>>>>
>>>> org.infinispan.client.hotrod.impl.transport.netty.HeaderDecoder.decode(HeaderDecoder.java:138)
>>>>          at
>>>>
>>>> org.infinispan.client.hotrod.impl.transport.netty.HintedReplayingDecoder.callDecode(HintedReplayingDecoder.java:98)
>>>>          at
>>>>
>>>> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>>          at
>>>>
>>>> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>>>>          at
>>>>
>>>> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>>>>          at
>>>>
>>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>>>>          at
>>>>
>>>> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
>>>>          at
>>>>
>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>>>>          at
>>>>
>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:647)
>>>>          at
>>>>
>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:582)
>>>>          at
>>>>
>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
>>>>          at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:461)
>>>>          at
>>>>
>>>> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>>>>          at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>          at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>          at java.lang.Thread.run(Thread.java:748)
>>>> _______________________________________________
>>>> keycloak-user mailing list
>>>> keycloak-user at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/keycloak-user
>>>>
>>>