[JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load

Monday, 1 February 2016

    [
https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin....
] 

kostd kostd commented on JGRP-1265:
-----------------------------------

We have seen this issue on our customer`s production environment. 
Affects version: 3.4.5
Node`s environment:  linux RHEL 2.6.32-431.1.2.el6.x86_64, wildfly 8.2.0.Final, hibernate
4.3.7.Final, infinispan 6.0.2.Final, jgroups 3.4.5.Final

There is 40Gb heap on each node`s jvm, and during heap-dump (after full gc not reproduced)
creation on coordinator node, the second node receives truncated cluster vew message. 

After dump creation, there is no new fully-rebuilded cluster view message, and we have
case when one node think, what both nodes in cluster, but other node think what only it
included.
In logs from both node`s hosts we can see what some ISPN000094-messages is missing:

{code}
N1 -- the first node of cluster, ip1
N2 -- the second one, ip2
heap on each node is 40Gb
dump creation takin` about ~30s.

12:xx N1 started first and isCoordinator at startup moment
13:5x N2 begin starting

//both nodes can see each other:
13:59:42,354 INFO N1 [JGroupsTransport] (Incoming-16,shared=tcp-for-l2) ISPN000094:
Received new cluster view: [ip1[0]/hibernate|3] (2) [ip1[0]/hibernate, ip2[0]/hibernate]
13:59:42,422 INFO N2 [JGroupsTransport] (ServerService Thread Pool -- 57) ISPN000094:
Received new cluster view: [ip1[0]/hibernate|3] (2) [ip1[0]/hibernate, ip2[0]/hibernate]

14:07 N2 started

14.30 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump 
14:31:20,390 INFO N2 [JGroupsTransport] (Incoming-4,shared=tcp-for-l2) ISPN000094:
Received new cluster view: [ip2[0]/hibernate|4] (1) [ip2[0]/hibernate]

// hibernate|5 is missing!!! Why? dump was created during transfer state?

15.01 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
15:01:21,928 INFO N2 [JGroupsTransport] (Incoming-6,shared=tcp-for-l2) ISPN000094:
Received new cluster view: [ip2[0]/hibernate|6] (1) [ip2[0]/hibernate]

// hibernate|7 is missing!!!

19:25 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
19:26:16,928 INFO N2 [JGroupsTransport] (Incoming-19,shared=tcp-for-l2) ISPN000094:
Received new cluster view: [ip2[0]/hibernate|8] (1) [ip2[0]/hibernate]

//hibernate|9 is missing!!! 

19:42:31,221 INFO N1 [JGroupsTransport] (Incoming-3,shared=tcp-for-l2) ISPN000094:
Received new cluster view: [ip1[0]/hibernate|10] (1) [ip1[0]/hibernate]

{code}

{code:title=configuration}

-- hibernate-l2 cache:

                <cache-container name="hibernate"
default-cache="local-query" module="org.hibernate">
                    <transport lock-timeout="60000"
stack="tcp-for-l2"/>
                    <local-cache name="local-query">
                        <transaction mode="NONE"
locking="OPTIMISTIC"/>
			<eviction strategy="LIRS" max-entries="500000"/>
			<expiration max-idle="3600000" lifespan="3600000"
interval="60000"/>
   		        <!-- TASK-64293 -->
		        <locking isolation="READ_COMMITTED"/>
                    </local-cache>
                    <invalidation-cache name="entity"
mode="SYNC">
                        <transaction mode="NON_XA"
locking="OPTIMISTIC"/>
			<eviction strategy="LIRS" max-entries="500000"/>
			<expiration max-idle="3600000" lifespan="3600000"
interval="60000"/>
   		        <!-- TASK-64293 -->
		        <locking isolation="READ_COMMITTED"/>
                    </invalidation-cache>
                    <replicated-cache name="timestamps"
mode="ASYNC">
                        <transaction mode="NONE"
locking="OPTIMISTIC"/>
                        <eviction strategy="NONE"/>
   		        <!-- TASK-64293 -->
		        <locking isolation="READ_COMMITTED"/>
                    </replicated-cache>
                </cache-container>

-- jgroups subsystem:

            <subsystem xmlns="urn:jboss:domain:jgroups:2.0"
default-stack="udp">
                <stack name="udp">
                    <transport type="UDP"
socket-binding="jgroups-udp"/>
                    <protocol type="PING"/>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK"
socket-binding="jgroups-udp-fd"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="UFC"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG2"/>
                    <protocol type="RSVP"/>
                </stack>
                <stack name="tcp-for-l2">
                    <transport type="TCP"
socket-binding="jgroups-tcp"/>
                    <protocol type="TCPPING">        
			<property
name="initial_hosts">${argus.jgroups-l2.tcpping.initial_hosts}</property>
			<property name="port_range">0</property>
		    </protocol>
                    <protocol type="MERGE2"/>
                    <protocol type="FD_SOCK"
socket-binding="jgroups-tcp-fd"/>
                    <protocol type="FD"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG2"/>
                    <protocol type="RSVP"/>
                </stack>
            </subsystem>

{code}

...
 Member can not join cluster after JVM high load
 -----------------------------------------------

                 Key: JGRP-1265
                 URL: https://issues.jboss.org/browse/JGRP-1265
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.11
         Environment: linux, kernel 2.6.18
            Reporter: Victor N
            Assignee: Bela Ban
             Fix For: 2.12

         Attachments: jgroups-tcp.xml

 In our production system I can see that a node desappers from the cluster if its server
was heavily-loaded. It's OK, but the node never comes back to the cluster even after
its server is working normally, without load. I can easily reproduce the problem in 2
cases:
 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
 Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other
members exclude this node from View.
 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
 In both situations the stuck node never reappears in the cluster (even after 1 h). Below
are more details.
 We have 12 nodes in our cluster, we problematic node is "gate5".
 View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain,
gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain,
gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
 View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain,
gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain,
gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
 The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But
I do NOT see response to this request!
 All jgroups threads are alive, not dead (I took stack traces).
 Another strange thing is that the problematic gate5 sends messages to other nodes and
even receives messages from SOME of them! How is it possible - I double-checked that ALL
other nodes have view_id=870 (without gate5)?
 The only assumption I have is race-conditions which occurs (as always) under high load.
 In normal situations such as temporary network failure everything works perfectly - gate5
joins the cluster. 

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006