[jboss-jira] [JBoss JIRA] Resolved: (JGRP-1168) Gossip Router's multiple socket connections /w same TCPGossip causes invalid node list

Vladimir Blagojevic (JIRA) jira-events at lists.jboss.org
Mon Jun 14 14:24:46 EDT 2010


     [ https://jira.jboss.org/browse/JGRP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Blagojevic resolved JGRP-1168.
---------------------------------------

    Resolution: Done


Resolved with GR focus on 2.10 release. 

> Gossip Router's multiple socket connections /w same TCPGossip causes invalid node list
> --------------------------------------------------------------------------------------
>
>                 Key: JGRP-1168
>                 URL: https://jira.jboss.org/browse/JGRP-1168
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9
>            Reporter: vivek v
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.10
>
>         Attachments: GR_Patch-1168.txt, GR_trace.txt, tcpgossip_trace.txt
>
>
> While testing the fix,  JGRP-1164 , we noticed there are still some cases where Gossip Router may publish wrong node list causing node isolation (as join wouldn't happen if coordinator is missing). Here is the scenario when GR may publish wrong node list,
> 1) Node A (coordinator) is connected /w Gossip Router
> 2) Node A times out while asking for members from GR 
> 3) RouterStub.getMembers(..) throws exception (Read Timed Out), which causes the state to be changed to DISCONNECTED
> 4) The connectionStateChanged(...) calls TCPGossip.connectionStatusChange(..), which calls RouterStub.destroy(...)
> 5) The RouterStub.destroy(..) sends the "Close' message to the Gossip Router and then closes the socket connection
> 6) TCPGossip starts the reconnector to make new socket connection to the Gossip Router
> Now the problem is at step 5 - as seen in the attached GR log (we've added some custom trace in Gossip Router code to find the problems). The CLOSE message reaches GR after the reconnect has happened (in attached trace, handler-14 thread (ConnectionHandler on GR) is the one which is supposed to be closed, but handler-15 thread starts before handler-14 is stopped). This causes the entry for Node A to be removed when handler-14 close is received, but the socket connection handler-15 is still open and thus, causes Gossip Router to publish the wrong node list (missing Node A).
> Note: We used WANem between Gossip Router and Node A to create random disconnects every 2-3 min. The disconnects would last for 30-45 seconds. There was also 10% packet loss.
> Few  Proposed Solutions
> -------------
> 1) Gossip Router shouldn't accept a new connection if a connection from that ip address already exists or else remove the old connection and then create the new one. This will guarantee there is only one-to-one relationship between a node and Gossip Router. 
> 2) Instead of using IP address the Gossip Router can use some sort of id for each connection handler in the node list map. This way we won't delete entries based on ip address, but id (like UUID).
> 3) TCPGossip should wait for the acknowledgement of CLOSE message (just like handshake for CONNECT). Only if the CLOSE either fails or succeeds that we should start the reconnector. This can be done in conjunction with solution 1.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list