[
https://issues.jboss.org/browse/JGRP-1618?page=com.atlassian.jira.plugin....
]
Bela Ban commented on JGRP-1618:
--------------------------------
Hi Harry,
excellent analysis, you really did dig through the code didn't you !
However, I'm afraid I don't support 2.8.x or 2.12.x any longer, see [1] for
reasons. I suggest you use this patch in a local build of 2.8.x and upgrade to a more
recent version at some later point in time.
I also don't see any implementations of missingMessageReceived() in master. Besides,
NakReceiverWindow has been replaced by Table and is only used in NAKACK. Note that NAKACK
has been superceeded by NAKACK2, see my blog post on the diffs.
Had this been a bug that's also present in master or the 3.2 branch, I'd have
fixed it, but xmit_stats was removed some time ago.
[1]
https://community.jboss.org/docs/DOC-14454
missingMessageReceived() never called in NAKACK resulting in memory
leak
------------------------------------------------------------------------
Key: JGRP-1618
URL:
https://issues.jboss.org/browse/JGRP-1618
Project: JGroups
Issue Type: Bug
Affects Versions: 2.8.1, 2.12.2
Environment: Java 6
Reporter: Harry Mark
Assignee: Bela Ban
Fix For: 3.2.9, 3.3
Attachments: NakReceiverWindow.java
We are using JGroups 2.8.1 and encountered a memory leak where it eventually ran out of
CMS Old Gen memory. The heap dump revealed that the problem was in the xmit_stats
ConcurrentHashMap of org.jgroups.protocols.pbcast.NAKACK.
After much analysis here's what we found: when the system is under load, messages
can start arriving out of order. When the receiver receives a higher sequence number than
expected, it requests the sender retransmit the missing messages with the lower sequence
numbers. The sender sends the missing message, however the bug in the NakReceiverWindow
meant that the missing message was never purged from the Map that tracks missing messages
(xmit_stats) because missingMessageReceived() was never invoked. Over time this Map grows
and starts using up CMS Old Gen; the only way it would get reduced was when a server left
the cluster and the missing messages were purged for that server.
In JMX, the MissingMsgsReceived attribute of
jgroups:cluster=*,protocol=NAKACK,type=protocol was always zero, confirming that it never
purged any received "missing messages".
When I looked at the most recent GA version 3.2.8 of NakReceiverWindow.java , it has the
corrected logic that ensures that missingMessageReceived() is called. I checked the the
most recent 2.x, which is 2.12.2, and found it also has the same bug in the logic as
2.8.1. This bug may apply to other 2.x but I did not check.
Attached is the fixed NakReceiverWindow.java for 2.8.1.
After applying the patch, the memory leak went away.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira