[JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load
by kostd kostd (JIRA)
[ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.... ]
kostd kostd commented on JGRP-1265:
-----------------------------------
[~sreenivasgattu], we jumped onto MERGE3 and no longer caught this issue after this on our production env. I think MERGE3 solves!
> Member can not join cluster after JVM high load
> -----------------------------------------------
>
> Key: JGRP-1265
> URL: https://issues.jboss.org/browse/JGRP-1265
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.11
> Environment: linux, kernel 2.6.18
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.12
>
> Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 1 month
[JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on JGRP-1265:
--------------------------------
Yes, MERGE3 fixes all sorts of asymmetric partitions. If you've got a scenario where MERGE3 doesn't work, send me the individual views, so I can investigate, e.g.
{noformat}
A: A|2=AB
B: A|2=AB
C: A|1=ABC
{noformat}
Here, A and B excluded C and installed view {{A|2}}, but C still thinks it is part of view {{A|1}}.
Also, logs at TRACE level for org.jgroups.protocols.MERGE3 and org.jgroups.protocols.pbcast.GMS would be helpful.
I assume you use a recent version, not one from the stone ages right? :-)
> Member can not join cluster after JVM high load
> -----------------------------------------------
>
> Key: JGRP-1265
> URL: https://issues.jboss.org/browse/JGRP-1265
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.11
> Environment: linux, kernel 2.6.18
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.12
>
> Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 1 month
[JBoss JIRA] (JGRP-1265) Member can not join cluster after JVM high load
by Sreenivas Gattu (JIRA)
[ https://issues.jboss.org/browse/JGRP-1265?page=com.atlassian.jira.plugin.... ]
Sreenivas Gattu commented on JGRP-1265:
---------------------------------------
Does MERGE3 solve the issue of brining the excluded members back into the JGroups cluster? Or it is only when two partition clusters are formed with each cluster having their own co-ordinator?
We have a similar issue where a node gets thrown out of JGroups membership after a 50+ sec GC pause. The node which underwent thinks that it has view of other members but other members do not.
> Member can not join cluster after JVM high load
> -----------------------------------------------
>
> Key: JGRP-1265
> URL: https://issues.jboss.org/browse/JGRP-1265
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.11
> Environment: linux, kernel 2.6.18
> Reporter: Victor N
> Assignee: Bela Ban
> Fix For: 2.12
>
> Attachments: jgroups-tcp.xml
>
>
> In our production system I can see that a node desappers from the cluster if its server was heavily-loaded. It's OK, but the node never comes back to the cluster even after its server is working normally, without load. I can easily reproduce the problem in 2 cases:
> 1) by taking a memory dump on the node: jmap -dump:format=b,file=dump.hprof <pid>
> Since we have 8-16 GB of RAM, this operation takes much time and blocks JVM - so other members exclude this node from View.
> 2) GC (garbage collection) - if JVM is doing GC constantly (and almost can not work)
> In both situations the stuck node never reappears in the cluster (even after 1 h). Below are more details.
> We have 12 nodes in our cluster, we problematic node is "gate5".
> View on gate5: [gate11.mydomain|869] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain, gate5.mydomain]
> View on gate11 (coordinator): [gate11.mydomain|870] [gate11.mydomain, gate2.mydomain, gate6.mydomain, gate7.mydomain, gate12.mydomain, gate4.mydomain, gate3.mydomain, gate10.mydomain, gate8.mydomain, gate9.mydomain, gate14.mydomain]
> The coordinator (gate11) is sending GET_MBRS_REQ periodically - I see them in gate5. But I do NOT see response to this request!
> All jgroups threads are alive, not dead (I took stack traces).
> Another strange thing is that the problematic gate5 sends messages to other nodes and even receives messages from SOME of them! How is it possible - I double-checked that ALL other nodes have view_id=870 (without gate5)?
> The only assumption I have is race-conditions which occurs (as always) under high load.
> In normal situations such as temporary network failure everything works perfectly - gate5 joins the cluster.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 1 month
[JBoss JIRA] (JGRP-2025) Counter enhancements
by Bela Ban (JIRA)
Bela Ban created JGRP-2025:
------------------------------
Summary: Counter enhancements
Key: JGRP-2025
URL: https://issues.jboss.org/browse/JGRP-2025
Project: JGroups
Issue Type: Feature Request
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 4.1
* Add option to _not_ use a backup coordinator and instead run a reconciliation round on a view change in which the new coordinator asks everyone for their counter values. Perhaps use some simple timestamping mechanism to resolve conflicts
* Add async operations, ie. returning {{CompletableFutures}}.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 1 month