[jboss-jira] [JBoss JIRA] Commented: (JGRP-1265) Member can not join cluster after JVM high load

Thu Jan 6 23:32:17 EST 2011

    [ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574001#comment-12574001 ] 

ronald yang commented on JGRP-1265:
-----------------------------------

I have seen something very similar to this.  I have a load test where every node in a cluster broadcasts to everyone else incessantly with no wait time.  If I send a sigstop to one process, wait a while, then resume it with a sigcont, there's a 50/50 chance it won't be allowed back in.  If this is the case then one or more of the other nodes will experience OOME in their NakReceiverWindow (or thereabouts).  I apologize in advance for not producing a self contained test case, but time hasn't permitted.

> Member can not join cluster after JVM high load
> -----------------------------------------------
>
>                 Key: JGRP-1265
>                 URL: https://issues.jboss.org/browse/JGRP-1265
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.11
>         Environment: linux, kernel 2.6.18
>            Reporter: Victor N
>            Assignee: Bela Ban
>             Fix For: 2.12
>
>         Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira