[
https://jira.jboss.org/jira/browse/JGRP-1179?page=com.atlassian.jira.plug...
]
Renaud Devarieux commented on JGRP-1179:
----------------------------------------
I have not been able to reproduce this issue using 2.10. I need yet to check if the
control of the logical and physical address of the GET_MBRS_RSP allowing to overwrite is
really the cause of the improving but I am confident it is.
However I ran into other issues tied to PING/Discovery. It's close but perhaps worth
another issue. Your call Bela.
Basically, the Discovery sends up to n GET_MBRS_REQ to discover the members. Each
GET_MBRS_REQ triggers a round of GET_MBRS_RSP which increase the initial_member count up
to its limit in the Promise blocking the discovery. One round of GET_MBRS_RSP may not be
sufficient to discover all the members, the second round of RSP then completes the count
of the Promise, but depending on the order of reception of the RSP, the Promise condition
may be signalled before all the RSP are processed, and those unprocessed RSP may belong to
a Coordinator elected between the two REQ sent. => trouble.
About TCPPING I am clueless, I haven't tried anything TCP with Jgroups.
Incoming PingRsp is ignored despite being sent by a Coordinator.
----------------------------------------------------------------
Key: JGRP-1179
URL:
https://jira.jboss.org/jira/browse/JGRP-1179
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.9, 2.6.14
Environment: Linux Red Hat Enterprise 5.0 kernel 2.6.18-8.el5 java 1.6.0_18
Reporter: Renaud Devarieux
Assignee: Bela Ban
Fix For: 2.10
I launch successively (nearly simultaneously) 5 nodes A B C D E using the same protocol
stack and one channel to communicate between themselves.
UDP(mcast_addr=231.8.8.8;mcast_port=45578):PING(num_initial_members=4):MERGE2:FD:VERIFY_SUSPECT:pbcast.NAKACK:pbcast.STABLE:FRAG2:pbcast.GMS(shun=true):pbcast.FLUSH
Often as not, it depends on the speed/rythm between each node launch, I get 2 views, ie
{D} and {A B C E}.
Merge occurs later but when it does it's a bit late for my application and I
don't think I should have to handle one save in case of a real electric/network
failure.
I noticed that on D I was timing out (3000ms) on during the discovery process despite
having received the 4 GET_MBRS_RSP of the other nodes. Then D would decide there was no
coordinator outside and become coordinator itself.
What seems to happen is D sends two GET_MBRS_REQ and A replies to both, but at the time
of the first reply, A is not yet coordinator and when D receives the second response, A
became coordinator but D ignores the response and doesn"t add it to its list of
Responses.
I have written a workaround in Discovery.Responses method addResponse, it seems to work
for my case but I am afraid it would break something else I am not aware of.
public void addResponse(PingRsp rsp) {
if(rsp == null)
return;
promise.getLock().lock();
try {
//Workaround 29/03/2010
int index = ping_rsps.indexOf(rsp);
// equivalent to does not contain.
if (index == -1) {
ping_rsps.add(rsp);
promise.getCond().signalAll();
} else if (rsp.isCoord()) {
PingRsp pr = ping_rsps.get(index);
//Check if the already existing element is not server
if (!pr.isCoord()) {
ping_rsps.set(index, rsp);
promise.getCond().signalAll();
}
}
/*if(!ping_rsps.contains(rsp)) {
ping_rsps.add(rsp);
promise.getCond().signalAll();
}*/ // Old JGroups code
}
finally {
promise.getLock().unlock();
}
}
Regards
Renaud
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira