[jboss-jira] [JBoss JIRA] Commented: (JGRP-1182) GET_MBRS_RSP are not all processed, Discovery step ends prematurely.

Fri Apr 23 09:41:11 EDT 2010

    [ https://jira.jboss.org/jira/browse/JGRP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12527306#action_12527306 ] 

Bela Ban commented on JGRP-1182:
--------------------------------

I don't think this can be fixed, it's in the nature of concurrent startup without an existing coordinator that the first responses are all non-coord responses.

A workaround to the problem is that break_on_coord_rsp is set to true (default anyway) and num_initial_members is set to a value greater than the max initial membership, so in your example above:

break_on_coord_rsp="true" num_initial_members="6" timeout="3000"

This way, concurrent startup without a pre-existing coordinator will wait for 6 members. D will get responses from D A B C and E, so it'll continue waiting. When it receives the 2nd GET_MBRS_RSP from A (this time as coord), break_on_coord_rsp will terminate the discovery phase.

Actually, this may not work, as A runs the same logic, and A and D could become coordinators at exactly the same time...

> GET_MBRS_RSP are not all processed, Discovery step ends prematurely.
> --------------------------------------------------------------------
>
>                 Key: JGRP-1182
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1182
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.9, 2.6.14, 2.10
>         Environment: Linux Red Hat Enterprise 5.0 kernel 2.6.18-8.el5 java 1.6.0_18 
>            Reporter: Renaud Devarieux
>            Assignee: Bela Ban
>             Fix For: 2.10
>
>
> I launch successively (nearly simultaneously) 5 nodes A B C D E on 5 hosts using the same protocol stack and one channel to communicate between themselves.
> UDP(mcast_addr=231.8.8.8;mcast_port=45578):PING(num_initial_members=5;timeout=800):MERGE2:FD:VERIFY_SUSPECT:pbcast.NAKACK:pbcast.STABLE:FRAG2:pbcast.GMS:pbcast.FLUSH 
> Discovery sends up to n GET_MBRS_REQ to discover the members. Each GET_MBRS_REQ triggers a round of GET_MBRS_RSP which increases the initial_member count up to its limit in the Promise blocking the discovery. One GET_MBRS_RSP round may not be sufficient to discover all the members, the second RSP round then completes the count of the Promise, but depending on the order of RSP reception, the Promise condition may be signalled before all the RSP are processed, and those unprocessed RSP may belong to a Coordinator elected between the two REQ sent. => trouble. 
> exemple:
> A B C D E are launched
> ...
> D sends GET_MBRS_REQ
> D receives 4 GET_MBRS_RSP from D A B C
> A becomes coordinator
> D sends GET_MBRS_REQ 400ms after the first
> D receives B GET_MBRS_RSP
> D receives E GET_MBRS_RSP and meets the discovery initial_members. Discovery ends in 428ms
> D receives A GET_MBRS_RSP A is coordinator but it's too late, it won't be counted in the set of responses
> D becomes coordinator.
> We have two coordinators.
> It may happen also if E is quicker and is part of the first RSP round.
> I am not sure yet of how to solve this problem. Obviously D should have been warned A was becoming coordinator or A was trying to at least.
> Perhaps if all the GET_MBRS traffic was multicast, each new member could spy it and try according the different REQ and RSP message find who is doing what. 
> I'd see well discovery split in two phase, on phase where a new member would  "silently" listen to the network then actively try to discover the other member with several GET_MBRS_REQ.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira