[JBoss JIRA] (JGRP-1618) missingMessageReceived() never called in NAKACK resulting in memory leak
by Bela Ban (Jira)
[ https://issues.jboss.org/browse/JGRP-1618?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on JGRP-1618:
--------------------------------
I guess this can only be reproduced by running a lot of load (concurrent requests) on the system. Even then, the chances of reproducing this are slim.
Hmm, perhaps, if you add a reordering protocol below {{NAKACK}} (e.g. {{SHUFFLE}}), then you can reproduce it...
> missingMessageReceived() never called in NAKACK resulting in memory leak
> --…
[View More]----------------------------------------------------------------------
>
> Key: JGRP-1618
> URL: https://issues.jboss.org/browse/JGRP-1618
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.8.1, 2.12.2
> Environment: Java 6
> Reporter: Harry Mark
> Assignee: Bela Ban
> Priority: Major
> Fix For: 3.2.9, 3.3
>
> Attachments: 1618-jgrp-patch.jar, NakReceiverWindow.java
>
>
> We are using JGroups 2.8.1 and encountered a memory leak where it eventually ran out of CMS Old Gen memory. The heap dump revealed that the problem was in the xmit_stats ConcurrentHashMap of org.jgroups.protocols.pbcast.NAKACK.
> After much analysis here's what we found: when the system is under load, messages can start arriving out of order. When the receiver receives a higher sequence number than expected, it requests the sender retransmit the missing messages with the lower sequence numbers. The sender sends the missing message, however the bug in the NakReceiverWindow meant that the missing message was never purged from the Map that tracks missing messages (xmit_stats) because missingMessageReceived() was never invoked. Over time this Map grows and starts using up CMS Old Gen; the only way it would get reduced was when a server left the cluster and the missing messages were purged for that server.
> In JMX, the MissingMsgsReceived attribute of jgroups:cluster=*,protocol=NAKACK,type=protocol was always zero, confirming that it never purged any received "missing messages".
> When I looked at the most recent GA version 3.2.8 of NakReceiverWindow.java , it has the corrected logic that ensures that missingMessageReceived() is called. I checked the the most recent 2.x, which is 2.12.2, and found it also has the same bug in the logic as 2.8.1. This bug may apply to other 2.x but I did not check.
> Attached is the fixed NakReceiverWindow.java for 2.8.1.
> After applying the patch, the memory leak went away.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
[View Less]
5 years, 4 months
[JBoss JIRA] (WFLY-12718) Clustering: replicated-cache sampling errors
by Tommasso Borgato (Jira)
[ https://issues.jboss.org/browse/WFLY-12718?page=com.atlassian.jira.plugin... ]
Tommasso Borgato edited comment on WFLY-12718 at 11/5/19 6:06 AM:
------------------------------------------------------------------
[~rhusar], [~pferraro] maybe this can help: WFLY-12754 is the same kind of error but in different scenario
was (Author: tommaso-borgato):
the kind of error is the same but in different scenarios
> Clustering: replicated-cache sampling errors
> ------------------------------…
[View More]--------------
>
> Key: WFLY-12718
> URL: https://issues.jboss.org/browse/WFLY-12718
> Project: WildFly
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 18.0.0.Final
> Reporter: Tommasso Borgato
> Assignee: Paul Ferraro
> Priority: Blocker
>
> The issue is about replicated-cache in fail-over tests.
> WildFly is started in clustered mode using a replicated cache for replicating HTTP session data across cluster nodes; all 4 nodes in the cluster are initialized with the following cli script:
> {noformat}
> embed-server --server-config=standalone-ha.xml
> /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl:add()
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/component=locking:write-attribute(name=isolation, value=REPEATABLE_READ)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/component=transaction:write-attribute(name=mode, value=BATCH)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/store=file:add()
> /subsystem=infinispan/cache-container=web:write-attribute(name=default-cache, value=testRepl)
> {noformat}
> The test is run with wildfly-18.0.0.Final.zip;
> The same tests run with version wildfly-17.0.1.Final.zip do not have any problem;
> hence this looks like a regression;
> As usual, we test that the serial value stored in the replicated cache is incremented at every call: when this is not true, we say we have a sampling error;
> Here are the runs that exhibit this issue:
> - **22.82% Fail Rate with WildFly-18 ** [eap-7.x-clustering-http-session-shutdown-repl#23|https://eap-qe-jenkins.r...]
> - **0% Fail Rate with WildFly-17 ** [eap-7.x-clustering-http-session-shutdown-repl#24|https://eap-qe-jenkins.r...]
> We also repeated the tests to make sure it can be reproduced:
> - **22.75% Fail rate with WildFly-18 ** [eap-7.x-clustering-http-session-shutdown-repl#26|https://eap-qe-jenkins.r...]
> - **0% Fail Rate with WildFly-17 ** [eap-7.x-clustering-http-session-shutdown-repl#25|https://eap-qe-jenkins.r...]
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
[View Less]
5 years, 4 months
[JBoss JIRA] (WFLY-12718) Clustering: replicated-cache sampling errors
by Tommasso Borgato (Jira)
[ https://issues.jboss.org/browse/WFLY-12718?page=com.atlassian.jira.plugin... ]
Tommasso Borgato edited comment on WFLY-12718 at 11/5/19 6:06 AM:
------------------------------------------------------------------
[~rhusar], [~pferraro] maybe this can help: WFLY-12754 is the same kind of error but in a different scenario
was (Author: tommaso-borgato):
[~rhusar], [~pferraro] maybe this can help: WFLY-12754 is the same kind of error but in different scenario
> Clustering: replicated-…
[View More]cache sampling errors
> --------------------------------------------
>
> Key: WFLY-12718
> URL: https://issues.jboss.org/browse/WFLY-12718
> Project: WildFly
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 18.0.0.Final
> Reporter: Tommasso Borgato
> Assignee: Paul Ferraro
> Priority: Blocker
>
> The issue is about replicated-cache in fail-over tests.
> WildFly is started in clustered mode using a replicated cache for replicating HTTP session data across cluster nodes; all 4 nodes in the cluster are initialized with the following cli script:
> {noformat}
> embed-server --server-config=standalone-ha.xml
> /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl:add()
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/component=locking:write-attribute(name=isolation, value=REPEATABLE_READ)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/component=transaction:write-attribute(name=mode, value=BATCH)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/store=file:add()
> /subsystem=infinispan/cache-container=web:write-attribute(name=default-cache, value=testRepl)
> {noformat}
> The test is run with wildfly-18.0.0.Final.zip;
> The same tests run with version wildfly-17.0.1.Final.zip do not have any problem;
> hence this looks like a regression;
> As usual, we test that the serial value stored in the replicated cache is incremented at every call: when this is not true, we say we have a sampling error;
> Here are the runs that exhibit this issue:
> - **22.82% Fail Rate with WildFly-18 ** [eap-7.x-clustering-http-session-shutdown-repl#23|https://eap-qe-jenkins.r...]
> - **0% Fail Rate with WildFly-17 ** [eap-7.x-clustering-http-session-shutdown-repl#24|https://eap-qe-jenkins.r...]
> We also repeated the tests to make sure it can be reproduced:
> - **22.75% Fail rate with WildFly-18 ** [eap-7.x-clustering-http-session-shutdown-repl#26|https://eap-qe-jenkins.r...]
> - **0% Fail Rate with WildFly-17 ** [eap-7.x-clustering-http-session-shutdown-repl#25|https://eap-qe-jenkins.r...]
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
[View Less]
5 years, 4 months
[JBoss JIRA] (WFLY-12718) Clustering: replicated-cache sampling errors
by Tommasso Borgato (Jira)
[ https://issues.jboss.org/browse/WFLY-12718?page=com.atlassian.jira.plugin... ]
Tommasso Borgato commented on WFLY-12718:
-----------------------------------------
the kind of error is the same but in different scenarios
> Clustering: replicated-cache sampling errors
> --------------------------------------------
>
> Key: WFLY-12718
> URL: https://issues.jboss.org/browse/WFLY-12718
> Project: WildFly
> Issue Type: …
[View More]Bug
> Components: Clustering
> Affects Versions: 18.0.0.Final
> Reporter: Tommasso Borgato
> Assignee: Paul Ferraro
> Priority: Blocker
>
> The issue is about replicated-cache in fail-over tests.
> WildFly is started in clustered mode using a replicated cache for replicating HTTP session data across cluster nodes; all 4 nodes in the cluster are initialized with the following cli script:
> {noformat}
> embed-server --server-config=standalone-ha.xml
> /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl:add()
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/component=locking:write-attribute(name=isolation, value=REPEATABLE_READ)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/component=transaction:write-attribute(name=mode, value=BATCH)
> /subsystem=infinispan/cache-container=web/replicated-cache=testRepl/store=file:add()
> /subsystem=infinispan/cache-container=web:write-attribute(name=default-cache, value=testRepl)
> {noformat}
> The test is run with wildfly-18.0.0.Final.zip;
> The same tests run with version wildfly-17.0.1.Final.zip do not have any problem;
> hence this looks like a regression;
> As usual, we test that the serial value stored in the replicated cache is incremented at every call: when this is not true, we say we have a sampling error;
> Here are the runs that exhibit this issue:
> - **22.82% Fail Rate with WildFly-18 ** [eap-7.x-clustering-http-session-shutdown-repl#23|https://eap-qe-jenkins.r...]
> - **0% Fail Rate with WildFly-17 ** [eap-7.x-clustering-http-session-shutdown-repl#24|https://eap-qe-jenkins.r...]
> We also repeated the tests to make sure it can be reproduced:
> - **22.75% Fail rate with WildFly-18 ** [eap-7.x-clustering-http-session-shutdown-repl#26|https://eap-qe-jenkins.r...]
> - **0% Fail Rate with WildFly-17 ** [eap-7.x-clustering-http-session-shutdown-repl#25|https://eap-qe-jenkins.r...]
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
[View Less]
5 years, 4 months