[jboss-jira] [JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load
kostd kostd (JIRA)
issues at jboss.org
Mon Feb 1 08:32:00 EST 2016
[ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157064#comment-13157064 ]
kostd kostd commented on JGRP-1265:
-----------------------------------
We have seen this issue on our customer`s production environment.
Affects version: 3.4.5
Node`s environment: linux RHEL 2.6.32-431.1.2.el6.x86_64, wildfly 8.2.0.Final, hibernate 4.3.7.Final, infinispan 6.0.2.Final, jgroups 3.4.5.Final
There is 40Gb heap on each node`s jvm, and during heap-dump (after full gc not reproduced) creation on coordinator node, the second node receives truncated cluster vew message.
After dump creation, there is no new fully-rebuilded cluster view message, and we have case when one node think, what both nodes in cluster, but other node think what only it included.
In logs from both node`s hosts we can see what some ISPN000094-messages is missing:
{code}
N1 -- the first node of cluster, ip1
N2 -- the second one, ip2
heap on each node is 40Gb
dump creation takin` about ~30s.
12:xx N1 started first and isCoordinator at startup moment
13:5x N2 begin starting
//both nodes can see each other:
13:59:42,354 INFO N1 [JGroupsTransport] (Incoming-16,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip1[0]/hibernate|3] (2) [ip1[0]/hibernate, ip2[0]/hibernate]
13:59:42,422 INFO N2 [JGroupsTransport] (ServerService Thread Pool -- 57) ISPN000094: Received new cluster view: [ip1[0]/hibernate|3] (2) [ip1[0]/hibernate, ip2[0]/hibernate]
14:07 N2 started
14.30 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
14:31:20,390 INFO N2 [JGroupsTransport] (Incoming-4,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip2[0]/hibernate|4] (1) [ip2[0]/hibernate]
// hibernate|5 is missing!!! Why? dump was created during transfer state?
15.01 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
15:01:21,928 INFO N2 [JGroupsTransport] (Incoming-6,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip2[0]/hibernate|6] (1) [ip2[0]/hibernate]
// hibernate|7 is missing!!!
19:25 heap-dump on N1!!!
//N2 recieves truncated cluster view during N1 creating heap-dump
19:26:16,928 INFO N2 [JGroupsTransport] (Incoming-19,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip2[0]/hibernate|8] (1) [ip2[0]/hibernate]
//hibernate|9 is missing!!!
19:42:31,221 INFO N1 [JGroupsTransport] (Incoming-3,shared=tcp-for-l2) ISPN000094: Received new cluster view: [ip1[0]/hibernate|10] (1) [ip1[0]/hibernate]
{code}
{code:title=configuration}
-- hibernate-l2 cache:
<cache-container name="hibernate" default-cache="local-query" module="org.hibernate">
<transport lock-timeout="60000" stack="tcp-for-l2"/>
<local-cache name="local-query">
<transaction mode="NONE" locking="OPTIMISTIC"/>
<eviction strategy="LIRS" max-entries="500000"/>
<expiration max-idle="3600000" lifespan="3600000" interval="60000"/>
<!-- TASK-64293 -->
<locking isolation="READ_COMMITTED"/>
</local-cache>
<invalidation-cache name="entity" mode="SYNC">
<transaction mode="NON_XA" locking="OPTIMISTIC"/>
<eviction strategy="LIRS" max-entries="500000"/>
<expiration max-idle="3600000" lifespan="3600000" interval="60000"/>
<!-- TASK-64293 -->
<locking isolation="READ_COMMITTED"/>
</invalidation-cache>
<replicated-cache name="timestamps" mode="ASYNC">
<transaction mode="NONE" locking="OPTIMISTIC"/>
<eviction strategy="NONE"/>
<!-- TASK-64293 -->
<locking isolation="READ_COMMITTED"/>
</replicated-cache>
</cache-container>
-- jgroups subsystem:
<subsystem xmlns="urn:jboss:domain:jgroups:2.0" default-stack="udp">
<stack name="udp">
<transport type="UDP" socket-binding="jgroups-udp"/>
<protocol type="PING"/>
<protocol type="MERGE3"/>
<protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>
<protocol type="FD_ALL"/>
<protocol type="VERIFY_SUSPECT"/>
<protocol type="pbcast.NAKACK2"/>
<protocol type="UNICAST3"/>
<protocol type="pbcast.STABLE"/>
<protocol type="pbcast.GMS"/>
<protocol type="UFC"/>
<protocol type="MFC"/>
<protocol type="FRAG2"/>
<protocol type="RSVP"/>
</stack>
<stack name="tcp-for-l2">
<transport type="TCP" socket-binding="jgroups-tcp"/>
<protocol type="TCPPING">
<property name="initial_hosts">${argus.jgroups-l2.tcpping.initial_hosts}</property>
<property name="port_range">0</property>
</protocol>
<protocol type="MERGE2"/>
<protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
<protocol type="FD"/>
<protocol type="VERIFY_SUSPECT"/>
<protocol type="pbcast.NAKACK2"/>
<protocol type="UNICAST3"/>
<protocol type="pbcast.STABLE"/>
<protocol type="pbcast.GMS"/>
<protocol type="MFC"/>
<protocol type="FRAG2"/>
<protocol type="RSVP"/>
</stack>
</subsystem>
{code}
> Member can not join cluster after JVM high load
> -----------------------------------------------
>
> Key: JGRP-1265
> URL: https://issues.jboss.org/browse/JGRP-1265
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.11
> Environment: linux, kernel 2.6.18
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.12
>
> Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
More information about the jboss-jira
mailing list