[jboss-jira] [JBoss JIRA] Commented: (JGRP-1265) Member can not join cluster after JVM high load
Bela Ban (JIRA)
jira-events at lists.jboss.org
Tue Jan 4 04:46:18 EST 2011
[ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573133#comment-12573133 ]
Bela Ban commented on JGRP-1265:
--------------------------------
Same with generating CPU load: I generate load for 30 seconds, the nodes gets excluded from the cluster but later rejoins.
> Member can not join cluster after JVM high load
> -----------------------------------------------
>
> Key: JGRP-1265
> URL: https://issues.jboss.org/browse/JGRP-1265
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.11
> Environment: linux, kernel 2.6.18
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.12
>
> Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list