[jboss-jira] [JBoss JIRA] Commented: (JGRP-812) NakReceiverWindow: possible race condition

Tue Aug 19 10:53:25 EDT 2008

    [ https://jira.jboss.org/jira/browse/JGRP-812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12425686#action_12425686 ] 

Bela Ban commented on JGRP-812:
-------------------------------

Your analysis looks correct. A simple solution would be to NOT accept any more messages in NAKACK.handleMessage() when either the window for a sender is not found *or* 'leaving' is true. Currently, we only discard the message when *both* conditions are true.

I'll have to investigate this a bit more as NRW.reset() can be called in multiple cases and we don't want to discard messages unneededly...

> NakReceiverWindow: possible race condition
> ------------------------------------------
>
>                 Key: JGRP-812
>                 URL: https://jira.jboss.org/jira/browse/JGRP-812
>             Project: JGroups
>          Issue Type: Bug
>            Reporter: Bela Ban
>            Assignee: Bela Ban
>             Fix For: 2.7
>
>
> [Randy Watler]
> I just ran into a really rare race condition that appears to be in
> NakReceiverWindow that occurred on a cluster member when another member
> was shutdown. I believe what happened here was the NakReceiverWindow for
> the node that left was reset, but then a late message was received from
> that node with an old sequence number that was greater than 200000000.
> The window then attempted to create that many NULL_MSG entries in the
> xmit_table and blew out the heap.
> Here is my somewhat circumstantial evidence for this claim. I managed to
> catch a thread adding a NULL_MSG to the xmit_table when the cluster
> should have been generally quiescent:
> "Incoming,cache,10.129.1.202:9931" prio=10 tid=0x000000005dac5800 nid=0x790d runnable [0x00000000412e5000..0x00000000412e5a80]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:883)
>         at org.jgroups.stack.NakReceiverWindow.add(NakReceiverWindow.java:294)
>         at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:802)
>         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:672)
>         at org.jgroups.protocols.BARRIER.up(BARRIER.java:122)
>         at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:167)
>         at org.jgroups.protocols.FD.up(FD.java:322)
>         at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:300)
>         at org.jgroups.protocols.MERGE2.up(MERGE2.java:145)
>         at org.jgroups.protocols.Discovery.up(Discovery.java:246)
>         at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1535)
>         at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1479)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
>         at java.lang.Thread.run(Thread.java:619)
> Here is the related NakReceiverWindow code:
>     287             // Case #4: we received a seqno higher than expected: add NULL_MSG values for missing messages, add to Retransmitter
>     288             if(seqno > next_to_add) {
>     289                 if(listener != null) {
>     290                     try {listener.messageGapDetected(next_to_add, seqno, msg.getSrc());} catch(Throwable t) {}
>     291                 }
>     292 
>     293                 for(long i=next_to_add; i < seqno; i++) { // add all msgs (missing or not) into table
>     294                     xmit_table.put(i, NULL_MSG);
>     295                 }
>     296                 xmit_table.put(seqno, msg);
>     297                 retransmitter.add(old_next, seqno -1);     // BUT: add only null messages to xmitter
>     298                 return true;
>     299             }
> By itself this certainly could be circumstantial, but this thread was
> captured while the JVM heap was being actively consumed and GC was
> taking over the machine resources. The following heap profile was
> captured along with the thread above:
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:      57613383     2765442384  java.util.concurrent.ConcurrentHashMap$HashEntry
>    2:      57578762     1381890288  java.lang.Long
>    3:          1824     1117553680  [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
>    4:        104423       11514120  [C
> Obviously, that is a big and growing ConcurrentHashMap keyed by Long...
> seems like it could be the xmit_table above. We have logging on for
> jgroups protocols. Here is the NAKACK messages log line BEFORE the heap
> started to be consumed:
> cache NAKACK msgs:      10.129.1.241:9931: [214528761 : 214531351 (214531351)]; 10.129.1.131:9931: [487104 : 487111 (487111)]; 10.129.1.203:
> 9931: [130428 : 130429 (130429)]; 10.129.1.202:9931: [123991 : 123994 (123994) (size=3, missing=0, highest stability=123991)]; 10.129.1.132:
> 9931: [487597 : 487601 (487601)]; 10.129.1.201:9931: [147919 : 147923 (147923)]; 10.129.1.133:9931: [487203 : 487209 (487209)]; 10.129.1.134
> :9931: [487666 : 487668 (487668)]
> and after:
> cache NAKACK msgs:      10.129.1.131:9931: [487126 : 487138 (487138)]; 10.129.1.203:9931: [130433 : 130434 (130434)]; 10.129.1.202:9931: [12
> 3997 : 123998 (123998) (size=1, missing=0, highest stability=123997)]; 10.129.1.132:9931: [487619 : 487636 (487636)]; 10.129.1.201:9931: [14
> 7936 : 147940 (147940)]; 10.129.1.133:9931: [487222 : 487244 (487244)]; 10.129.1.134:9931: [487678 : 487696 (487696)]
> So, its pretty clear that the 10.129.1.241 node left the view and its
> last sequence number was approximately 214531351. That number is
> sufficiently large to explain the fact I was able to catch this
> behavior, the fact that it was consuming the heap, and apparently had
> already added 57M entries to some ConcurrentHashMap. Logging in GMS and
> NAKACK also verified the view change:
> 2008-08-12 03:52:51,088 DEBUG  [FD_SOCK pinger,cache,10.129.1.202:9931]: org.jgroups.protocols.FD_SOCK - peer closed socket normally
> 2008-08-12 03:52:51,184 DEBUG  [FD_SOCK pinger,cache,10.129.1.202:9931]: org.jgroups.protocols.FD_SOCK - pinger thread terminated
> 2008-08-12 03:52:51,189 DEBUG  [Incoming,cache,10.129.1.202:9931]: org.jgroups.protocols.pbcast.GMS - view=[10.129.1.131:9931|8] [10.129.1.1
> 31:9931, 10.129.1.132:9931, 10.129.1.133:9931, 10.129.1.134:9931, 10.129.1.203:9931, 10.129.1.201:9931, 10.129.1.202:9931]
> 2008-08-12 03:52:51,193 DEBUG  [Incoming,cache,10.129.1.202:9931]: org.jgroups.protocols.pbcast.GMS - [local_addr=10.129.1.202:9931] view is
>  [10.129.1.131:9931|8] [10.129.1.131:9931, 10.129.1.132:9931, 10.129.1.133:9931, 10.129.1.134:9931, 10.129.1.203:9931, 10.129.1.201:9931, 10
> .129.1.202:9931]
> 2008-08-12 03:52:51,194 WARN  [Incoming,cache,10.129.1.202:9931]: org.jgroups.protocols.pbcast.NAKACK - 10.129.1.202:9931] discarded message
>  from non-member 10.129.1.241:9931, my view is [10.129.1.131:9931|7] [10.129.1.131:9931, 10.129.1.132:9931, 10.129.1.133:9931, 10.129.1.134:
> 9931, 10.129.1.203:9931, 10.129.1.201:9931, 10.129.1.202:9931, 10.129.1.241:9931]
> 2008-08-12 03:52:51,195 INFO  [Incoming,cache,10.129.1.202:9931]: com.getmedium.medium.cache.lateral.endpoint.LateralCacheEndpoint - Current
>  membership for cache group: [10.129.1.131:9931, 10.129.1.132:9931, 10.129.1.133:9931, 10.129.1.134:9931, 10.129.1.203:9931, 10.129.1.201:99
> 31, 10.129.1.202:9931]
> 2008-08-12 03:52:51,201 DEBUG  [Incoming,cache,10.129.1.202:9931]: org.jgroups.protocols.pbcast.GMS - sending VIEW_ACK to 10.129.1.131:9931
> 2008-08-12 03:52:51,201 DEBUG  [Timer,null,null]: org.jgroups.protocols.FD_SOCK - VIEW_CHANGE received: [10.129.1.131:9931, 10.129.1.132:993
> 1, 10.129.1.133:9931, 10.129.1.134:9931, 10.129.1.203:9931, 10.129.1.201:9931, 10.129.1.202:9931]
> 2008-08-12 03:52:51,201 WARN  [Incoming,cache,10.129.1.202:9931]: org.jgroups.protocols.pbcast.NAKACK - 10.129.1.202:9931] discarded message
>  from non-member 10.129.1.241:9931, my view is [10.129.1.131:9931|8] [10.129.1.131:9931, 10.129.1.132:9931, 10.129.1.133:9931, 10.129.1.134:
> 9931, 10.129.1.203:9931, 10.129.1.201:9931, 10.129.1.202:9931]
> 2008-08-12 03:52:51,201 WARN  [Incoming,cache,10.129.1.202:9931]: org.jgroups.protocols.pbcast.NAKACK - 10.129.1.202:9931] discarded message
>  from non-member 10.129.1.241:9931, my view is [10.129.1.131:9931|8] [10.129.1.131:9931, 10.129.1.132:9931, 10.129.1.133:9931, 10.129.1.134:
> 9931, 10.129.1.203:9931, 10.129.1.201:9931, 10.129.1.202:9931]
> 2008-08-12 03:52:51,201 DEBUG  [FD_SOCK pinger,cache,10.129.1.202:9931]: org.jgroups.protocols.FD_SOCK - pinger_thread started
> Upon inspecting the code in NakReceiverWindow, it seems plausible to me
> that a message arrived in add() just after the view was adjusted and
> reset() was called. Given that the sequence numbers are fairly high in
> this case, its easy to see why that tight loop in add() might attempt
> what is trying to do here. Does this scenario seem possible as I present
> it? If so, is there a way to bound "case#4" and/or shutdown the add()
> after a reset?
> Unfortunately, this is the ONLY time I have seen this behavior in over 2
> years of working with JGroups, so I do not expect to be able to
> reproduce it again for another year. I just happened to catch it this
> time around in a state where I was able to diagnose it to some degree
> before the JVM went dark. Please let me know if I can help diagnose this
> further.
> Randy Watler

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira