[
https://jira.jboss.org/jira/browse/JGRP-177?page=com.atlassian.jira.plugi...
]
Victor N commented on JGRP-177:
-------------------------------
Bela,
not sure whether my problem is exact the same or something similar,
I ran my test on 5 nodes (N1...N5) with a simple tcp config with tcpping (based on
tcp.xml from JGroups 2.7 sources) and everything was working for about 3 days, but then I
saw that only 4 nodes can see each other and receive messages from each other, and one of
the nodes (N2) is excluded from theirs view.
I looked into logs, it is interesting:
view at N1,N3,N4,N5 is {N1,N3,N4,N5}
view at N2 is {N1,N2,N3,N4,N5} - all 5 nodes!
N2 did not receive viewAccepted and it continues sending messages to all other nodes (I
see in tcpdump), but those nodes know that N2 is not member, so they respond with
"discarded message from non-member".
The situation does not change during several hours, N2 does not receive the updated view
and continues sending messages to all the nodes!
Why does not N2 receive the new view? Or why does not it react to "discarded message
from non-member" error from other nodes?
Join problem
------------
Key: JGRP-177
URL:
https://jira.jboss.org/jira/browse/JGRP-177
Project: JGroups
Issue Type: Bug
Affects Versions: 2.2.8, 2.2.9, 2.2.9.1
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 2.3
Attachments: BaseJGroupsTestCase.java, jgroups.xml, JGroupsTestMain.java,
JGroupsTestRemote.java, test.zip
I run a testcase that spawns 4 JGroups nodes in 4 separate java processes. Several nodes
are then restarted at random and try to reconnect to the group.
The first node sends a ping and counts the responses received by each node.
After a couple of iterations ranging from 20 to 100, some nodes are unable to join the
group.
I use JGroups 2.2.9 with a TCP based config (TCP / TCPPING / MERGE2 / FD or FD_SOCK /
VERIFY_SUSPECT / pbcast.NAKACK / pbcast.STABLE / VIEW_SYNC / pbcast.GMS ).
EXAMPLE 1 with FD_SOCK:
Node 0
WARN [GMS] failed to collect all ACKs (1) for view [127.0.0.1:7700|32] after 20000ms,
missing ACKs from [127.0.0.1:7701] (received=[127.0.0.1:7700])
Ping result: {127.0.0.1:7701=3, 127.0.0.1:7700=3, 127.0.0.1:7703=3}
Node 1
WARN [NAKACK] 127.0.0.1:7701] discarded message from non-member 127.0.0.1:7702
WARN [NAKACK] 127.0.0.1:7701] discarded message from non-member 127.0.0.1:7702
Node 2
WARN [NAKACK] 127.0.0.1:7702] discarded message from non-member 127.0.0.1:7700
ERROR [FD_SOCK] received null cache; retrying
ERROR [FD_SOCK] received null cache; retrying
ERROR [FD_SOCK] received null cache; retrying
Node 3
WARN [NAKACK] 127.0.0.1:7703] discarded message from non-member 127.0.0.1:7702
WARN [NAKACK] 127.0.0.1:7703] discarded message from non-member 127.0.0.1:7702
EXAMPLE 2 with FD timeout="2000" max_tries="4":
Node 0
Ping result: {127.0.0.1:7701=0, 127.0.0.1:7700=2, 127.0.0.1:7703=2}
Node 1
WARN [GMS] handleJoin(127.0.0.1:7701)() should not be invoked on an instance of
org.jgroups.protocols.pbcast.ClientGmsImpl
WARN [GMS] join(127.0.0.1:7701) failed (coord=127.0.0.1:7701), retrying
WARN [GMS] handleJoin(127.0.0.1:7701)() should not be invoked on an instance of
org.jgroups.protocols.pbcast.ClientGmsImpl
WARN [GMS] join(127.0.0.1:7701) failed (coord=127.0.0.1:7701), retrying
Node 2
No ERROR or WARN messages.
Node 3
WARN [GMS] join(127.0.0.1:7703) failed (coord=127.0.0.1:7701), retrying
WARN [GMS] join(127.0.0.1:7703) failed (coord=127.0.0.1:7700), retrying
Is there something wrong with my JGroups config ?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira