[jboss-jira] [JBoss JIRA] Updated: (JGRP-1265) Member can not join cluster after JVM high load

Thursday, 30 December 2010



     [
https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin....
]

Victor N updated JGRP-1265:
---------------------------

                   Summary: Member can not join cluster after JVM high load  (was: Member
can not join cluster after heavy load)
    Workaround Description: disconnect from JChannel and connect again.  (was: Workaround:
disconnect from JChannel and connect again.)
        Steps to Reproduce: Start several nodes (>3 I think), set Xmx for JVM to 8 or
even 16 Gbytes, then use jmap tool (mentioned above) to take a memory dump. You should set
config accordingly - so that other nodes will update their view while taking the dump. In
my tests the problematic node where I did tests was NOT coordinator.  (was: Steps to
reproduce:
Start several nodes (>3 I think), set Xmx for JVM to 8 or even 16 Gbytes, then use jmap
tool (mentioned above) to take a memory dump. You should set config accordingly - so that
other nodes will update their view while taking the dump. In my tests the problematic node
where I did tests was NOT coordinator.)


...
 Member can not join cluster after JVM high load
 -----------------------------------------------

                 Key: JGRP-1265
                 URL: https://issues.jboss.org/browse/JGRP-1265
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.11
         Environment: linux, kernel 2.6.18
            Reporter: Victor N
            Assignee: Bela Ban

 In our production system I can see that a node desappers from the cluster if its server
was heavily-loaded. It's OK, but the node never comes back to the cluster even after
its server is working normally, without load. I can easily reproduce the problem in 2
cases:
 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
 Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other
members exclude this node from View.
 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
 In both situations the stuck node never reappears in the cluster (even after 1 h). Below
are more details.
 We have 12 nodes in our cluster, we problematic node is "gate5".
 View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain,
gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain,
gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
 View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain,
gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain,
gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
 The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But
I do NOT see response to this request!
 All jgroups threads are alive, not dead (I took stack traces).
 Another strange thing is that the problematic gate5 sends messages to other nodes and
even receives messages from SOME of them! How is it possible - I double-checked that ALL
other nodes have view_id=870 (without gate5)?
 The only assumption I have is race-conditions which occurs (as always) under high load.
 In normal situations such as temporary network failure everything works perfectly - gate5
joins the cluster. 
-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Updated: (JGRP-1265) Member can not join cluster after JVM high load