[JBoss JIRA] (WFLY-13514) Huge memory leak in WildFly
by Matěj Novotný (Jira)
[ https://issues.redhat.com/browse/WFLY-13514?page=com.atlassian.jira.plugi... ]
Matěj Novotný commented on WFLY-13514:
--------------------------------------
[~konbk] funy thing is, I cannot reproduce this manually. If I just try and GET that page from browser (repeatedly OFC), it never goes OOM.
And if you monitor the memory usage, it goes up the same way on each request, but once you perform GC, it goes back down.
>From Weld perspective, we store req. scoped beans in request's attributes. We do have some caches in place to make things faster, but ultimately it is the request holding them.
Once our http listener gets notified about request ending, we proceed to bean destruction (calling pre-destroy callbacks), clearing caches and detaching bean storage so no further context manipulation gets propagated to request.
But we don't really seem to remove beans from request at this point (we don't invoke {{HttpServletRequest.removeAttribute(key)}}) - we never did though. I guess we assume the request is deleted by Undertow once it is over which seems reasonable; but it is not what's happening with your reproducer.
At least from the heap dump I was seeing that it is still kept as attribute under {{attributes io.undertow.servlet.spec.HttpServletRequestImpl @ 0xf878ef48}} which in turn was in session that was considered active. Maybe someone from Undertow team ([~flavia.rainone]?) would know more about how are requests handled in this case and whether {{HttpServletRequestImpl}} should under some circumstances survive a request ending?
> Huge memory leak in WildFly
> ---------------------------
>
> Key: WFLY-13514
> URL: https://issues.redhat.com/browse/WFLY-13514
> Project: WildFly
> Issue Type: Bug
> Components: CDI / Weld, JSF
> Affects Versions: 18.0.1.Final, 19.1.0.Final, 20.0.0.Beta1
> Reporter: Konrad Bak
> Assignee: Matěj Novotný
> Priority: Critical
>
> https://github.com/konbk/bug-report-wildfly-oome
> The project shows a huge memory leak introduced in recent WildFly versions that can crash a real-world Java EE application in a matter of hours.
> When a JSF page contains any component bound to page variable, all CDI beans (regardless of scope) used on that page stay in memory.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
4 years
[JBoss JIRA] (WFLY-13357) (Regression) Execution of concurrent batch jobs containg partitioned steps causes deadlock
by James Perkins (Jira)
[ https://issues.redhat.com/browse/WFLY-13357?page=com.atlassian.jira.plugi... ]
James Perkins updated WFLY-13357:
---------------------------------
Priority: Blocker (was: Major)
> (Regression) Execution of concurrent batch jobs containg partitioned steps causes deadlock
> ------------------------------------------------------------------------------------------
>
> Key: WFLY-13357
> URL: https://issues.redhat.com/browse/WFLY-13357
> Project: WildFly
> Issue Type: Bug
> Components: Batch
> Affects Versions: 19.0.0.Final
> Reporter: Felix König
> Assignee: Cheng Fang
> Priority: Blocker
> Fix For: 20.0.0.Final
>
>
> Hello,
> the issue described in JBERET-180 seems to have reappeared. I am running Wildfly 16 with jberet-1.3.3. Given that there is a default batch-thread count of 10 I was able to produce a deadlock by starting 10 instances of a partitioned job simultaneously. None of the job runs fast enough to finish before all 10 jobs have been started. All 10 Batch-threads are stuck here:
> {code}
> "Batch Thread - 1@33537" prio=5 tid=0x109 nid=NA waiting
> java.lang.Thread.State: WAITING
> at jdk.internal.misc.Unsafe.park(Unknown Source:-1)
> at java.util.concurrent.locks.LockSupport.park(Unknown Source:-1)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source:-1)
> at java.util.concurrent.ArrayBlockingQueue.take(Unknown Source:-1)
> at org.jberet.runtime.runner.StepExecutionRunner.beginPartition(StepExecutionRunner.java:350)
> at org.jberet.runtime.runner.StepExecutionRunner.runBatchletOrChunk(StepExecutionRunner.java:222)
> at org.jberet.runtime.runner.StepExecutionRunner.run(StepExecutionRunner.java:144)
> at org.jberet.runtime.runner.CompositeExecutionRunner.runStep(CompositeExecutionRunner.java:164)
> at org.jberet.runtime.runner.CompositeExecutionRunner.runFromHeadOrRestartPoint(CompositeExecutionRunner.java:88)
> at org.jberet.runtime.runner.JobExecutionRunner.run(JobExecutionRunner.java:60)
> at org.wildfly.extension.batch.jberet.deployment.BatchEnvironmentService$WildFlyBatchEnvironment$1.run(BatchEnvironmentService.java:180)
> at org.wildfly.extension.requestcontroller.RequestController$QueuedTask$1.run(RequestController.java:494)
> at org.jberet.spi.JobExecutor$2.run(JobExecutor.java:149)
> at org.jberet.spi.JobExecutor$1.run(JobExecutor.java:99)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source:-1)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source:-1)
> at java.lang.Thread.run(Unknown Source:-1)
> at org.jboss.threads.JBossThread.run(JBossThread.java:485)
> {code}
> which is this line of code:
> {code:java}
> completedPartitionThreads.take();
> {code}
> Rarely some threads also get stuck at line 364 instead, which is
> {code:java}
> final Serializable data = collectorDataQueue.take();
> {code}
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
4 years
[JBoss JIRA] (WFLY-13357) (Regression) Execution of concurrent batch jobs containg partitioned steps causes deadlock
by James Perkins (Jira)
[ https://issues.redhat.com/browse/WFLY-13357?page=com.atlassian.jira.plugi... ]
James Perkins commented on WFLY-13357:
--------------------------------------
I'm changing this to a blocker for 20.0.0.Final since it's a regression.
> (Regression) Execution of concurrent batch jobs containg partitioned steps causes deadlock
> ------------------------------------------------------------------------------------------
>
> Key: WFLY-13357
> URL: https://issues.redhat.com/browse/WFLY-13357
> Project: WildFly
> Issue Type: Bug
> Components: Batch
> Affects Versions: 19.0.0.Final
> Reporter: Felix König
> Assignee: Cheng Fang
> Priority: Blocker
> Fix For: 20.0.0.Final
>
>
> Hello,
> the issue described in JBERET-180 seems to have reappeared. I am running Wildfly 16 with jberet-1.3.3. Given that there is a default batch-thread count of 10 I was able to produce a deadlock by starting 10 instances of a partitioned job simultaneously. None of the job runs fast enough to finish before all 10 jobs have been started. All 10 Batch-threads are stuck here:
> {code}
> "Batch Thread - 1@33537" prio=5 tid=0x109 nid=NA waiting
> java.lang.Thread.State: WAITING
> at jdk.internal.misc.Unsafe.park(Unknown Source:-1)
> at java.util.concurrent.locks.LockSupport.park(Unknown Source:-1)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source:-1)
> at java.util.concurrent.ArrayBlockingQueue.take(Unknown Source:-1)
> at org.jberet.runtime.runner.StepExecutionRunner.beginPartition(StepExecutionRunner.java:350)
> at org.jberet.runtime.runner.StepExecutionRunner.runBatchletOrChunk(StepExecutionRunner.java:222)
> at org.jberet.runtime.runner.StepExecutionRunner.run(StepExecutionRunner.java:144)
> at org.jberet.runtime.runner.CompositeExecutionRunner.runStep(CompositeExecutionRunner.java:164)
> at org.jberet.runtime.runner.CompositeExecutionRunner.runFromHeadOrRestartPoint(CompositeExecutionRunner.java:88)
> at org.jberet.runtime.runner.JobExecutionRunner.run(JobExecutionRunner.java:60)
> at org.wildfly.extension.batch.jberet.deployment.BatchEnvironmentService$WildFlyBatchEnvironment$1.run(BatchEnvironmentService.java:180)
> at org.wildfly.extension.requestcontroller.RequestController$QueuedTask$1.run(RequestController.java:494)
> at org.jberet.spi.JobExecutor$2.run(JobExecutor.java:149)
> at org.jberet.spi.JobExecutor$1.run(JobExecutor.java:99)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source:-1)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source:-1)
> at java.lang.Thread.run(Unknown Source:-1)
> at org.jboss.threads.JBossThread.run(JBossThread.java:485)
> {code}
> which is this line of code:
> {code:java}
> completedPartitionThreads.take();
> {code}
> Rarely some threads also get stuck at line 364 instead, which is
> {code:java}
> final Serializable data = collectorDataQueue.take();
> {code}
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
4 years
[JBoss JIRA] (JGRP-2470) JBDC_PING can face a split-brain issue when restarting a coordinator node
by Masafumi Miura (Jira)
[ https://issues.redhat.com/browse/JGRP-2470?page=com.atlassian.jira.plugin... ]
Masafumi Miura commented on JGRP-2470:
--------------------------------------
bq. What's the rationale for reverting JGRP-2199? If the old coord removes all information, the new one will re-insert it (actually multiple times, if configured)...
The issue (two singleton clusters) happens if the old coordinator removes all information from the table *after* the new coordinator updates the table. As updating the table by the old coordinator and the new coordinator can happen in parallel, I think there's no way to avoid this possible ordering issue with the current implementation.
And, if I understand correctly, the new coordinator can re-insert the updated cluster members on the view changes, but the view change does not happen (until MEGE3 heals this situation) because all nodes run as a singleton cluster.
It seems that a periodic findMembers() triggered by MERGE3 can heal the singleton clusters situation. The findMembers() can insert their own node information when the table is empty. But this happens after the interval calculated by Math.max(min_interval, Util.random(max_interval) + max_interval/2) where min_interval is 10000 and max_interval is 30000 in JBoss EAP by default. And, the actual merge happens after this. So, it could take a long time to heal the singleton clusters situation.
As I wrote in the private comment, I would like to propose the following two changes:
- Remove calling removeAll(cluster_name) in JDBC_PING#stop()
- Change remove_old_coords_on_view_change (or remove_all_data_on_view_change) to true by default
I think the latter change can mitigate JGRP-2199.
What do you think?
> JBDC_PING can face a split-brain issue when restarting a coordinator node
> -------------------------------------------------------------------------
>
> Key: JGRP-2470
> URL: https://issues.redhat.com/browse/JGRP-2470
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 4.1.9, 4.0.22
> Reporter: Masafumi Miura
> Assignee: Radoslav Husar
> Priority: Major
> Fix For: 4.2.5
>
>
> After [the change|https://github.com/belaban/JGroups/commit/215cdb6] for JGRP-2199, JDBC_PING deletes all entries from the table during the shutdown of the coordinator node.
> This behavior has a possibility to cause a split-brain when restarting a coordinator node. Because, as all entries are lost in the following scenario, the restarting node can not find any information about existing nodes from the table and does not form a cluster.
> 0. node1 and node2 form a cluster. The node1 is a coordinator.
> 1. Trigger a restart of the node1
> 2. The node1 removes their node information from the table
> 3. The node2 becomes a new coordinator
> 4. The node2 updates their node information in the table
> 5. The node1 clears all entries from the table
> 6. The node1 starts again
> 7. The node1 does not join the existing cluster because there's no node information in the table
> Note: If step 5 happens before step 4, the split-brain issue does not happen. However, as step 4 and step 5 happen on different nodes, these steps can happen in parallel. So, the order is undefined. So, for example, if the shutdown of node1 takes a long time, there's a high possibility to face this issue.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
4 years