[jboss-jira] [JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load

Mon Feb 1 08:32:00 EST 2016

    [ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157064#comment-13157064 ] 

kostd kostd commented on JGRP-1265:
-----------------------------------

We have seen this issue on our customer`s production environment. 
Affects version: 3.4.5
Node`s environment:  linux RHEL 2.6.32-431.1.2.el6.x86_64, wildfly 8.2.0.Final, hibernate 4.3.7.Final, infinispan 6.0.2.Final, jgroups 3.4.5.Final

There is 40Gb heap on each node`s jvm, and during heap-dump (after full gc not reproduced) creation on coordinator node, the second node receives truncated cluster vew message. 

After dump creation, there is no new fully-rebuilded cluster view message, and we have case when one node think, what both nodes in cluster, but other node think what only it included.
In logs from both node`s hosts we can see what some ISPN000094-messages is missing:

{code}
N1 -- the first node of cluster, ip1
N2 -- the second one, ip2
heap on each node is 40Gb
dump creation takin` about ~30s.

12:xx N1 started first and isCoordinator at startup moment
13:5x N2 begin starting

//both nodes can see each other:
13:59:42,354 INFO N1 [JGroupsTransport] (Incoming-16,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip1[0]/hibernate|3] (2) [ip1[0]/hibernate, ip2[0]/hibernate]
13:59:42,422 INFO N2 [JGroupsTransport] (ServerService Thread Pool -- 57) ISPN000094: Received new cluster view: [ip1[0]/hibernate|3] (2) [ip1[0]/hibernate, ip2[0]/hibernate]

14:07 N2 started

14.30 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump 
14:31:20,390 INFO N2 [JGroupsTransport] (Incoming-4,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip2[0]/hibernate|4] (1) [ip2[0]/hibernate]

// hibernate|5 is missing!!! Why? dump was created during transfer state?

15.01 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
15:01:21,928 INFO N2 [JGroupsTransport] (Incoming-6,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip2[0]/hibernate|6] (1) [ip2[0]/hibernate]

// hibernate|7 is missing!!!

19:25 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
19:26:16,928 INFO N2 [JGroupsTransport] (Incoming-19,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip2[0]/hibernate|8] (1) [ip2[0]/hibernate]

//hibernate|9 is missing!!! 

19:42:31,221 INFO N1 [JGroupsTransport] (Incoming-3,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip1[0]/hibernate|10] (1) [ip1[0]/hibernate]

{code}

{code:title=configuration}

-- hibernate-l2 cache:

                <cache-container name="hibernate" default-cache="local-query" module="org.hibernate">
                    <transport lock-timeout="60000" stack="tcp-for-l2"/>
                    <local-cache name="local-query">
                        <transaction mode="NONE" locking="OPTIMISTIC"/>
			<eviction strategy="LIRS" max-entries="500000"/>
			<expiration max-idle="3600000" lifespan="3600000" interval="60000"/>
   		        <!-- TASK-64293 -->
		        <locking isolation="READ_COMMITTED"/>
                    </local-cache>
                    <invalidation-cache name="entity" mode="SYNC">
                        <transaction mode="NON_XA" locking="OPTIMISTIC"/>
			<eviction strategy="LIRS" max-entries="500000"/>
			<expiration max-idle="3600000" lifespan="3600000" interval="60000"/>
   		        <!-- TASK-64293 -->
		        <locking isolation="READ_COMMITTED"/>
                    </invalidation-cache>
                    <replicated-cache name="timestamps" mode="ASYNC">
                        <transaction mode="NONE" locking="OPTIMISTIC"/>
                        <eviction strategy="NONE"/>
   		        <!-- TASK-64293 -->
		        <locking isolation="READ_COMMITTED"/>
                    </replicated-cache>
                </cache-container>

-- jgroups subsystem:

            <subsystem xmlns="urn:jboss:domain:jgroups:2.0" default-stack="udp">
                <stack name="udp">
                    <transport type="UDP" socket-binding="jgroups-udp"/>
                    <protocol type="PING"/>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="UFC"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG2"/>
                    <protocol type="RSVP"/>
                </stack>
                <stack name="tcp-for-l2">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <protocol type="TCPPING">        
			<property name="initial_hosts">${argus.jgroups-l2.tcpping.initial_hosts}</property>
			<property name="port_range">0</property>
		    </protocol>
                    <protocol type="MERGE2"/>
                    <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                    <protocol type="FD"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG2"/>
                    <protocol type="RSVP"/>
                </stack>
            </subsystem>

{code}

> Member can not join cluster after JVM high load
> -----------------------------------------------
>
>                 Key: JGRP-1265
>                 URL: https://issues.jboss.org/browse/JGRP-1265
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.11
>         Environment: linux, kernel 2.6.18
>            Reporter: Victor N
>            Assignee: Bela Ban
>             Fix For: 2.12
>
>         Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)