[jboss-jira] [JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load
kostd kostd (JIRA)
issues at jboss.org
Wed Feb 3 10:15:00 EST 2016
[ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158312#comment-13158312 ]
kostd kostd commented on JGRP-1265:
-----------------------------------
[~belaban], thank u.
I found your tcp stack (https://github.com/belaban/JGroups/blob/master/conf/tcp.xml), can see diff between my (tcp-for-l2), will use your stack.
Hope that MERGE3 helps
> Member can not join cluster after JVM high load
> -----------------------------------------------
>
> Key: JGRP-1265
> URL: https://issues.jboss.org/browse/JGRP-1265
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.11
> Environment: linux, kernel 2.6.18
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.12
>
> Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
More information about the jboss-jira
mailing list