[jboss-jira] [JBoss JIRA] Commented: (JGRP-39) A TCP stack does not correctly detect failure (pulled cable) for certain TCPPING configurations

Thu Aug 24 11:15:12 EDT 2006

    [ http://jira.jboss.com/jira/browse/JGRP-39?page=comments#action_12341769 ] 

Brian Stansberry commented on JGRP-39:
--------------------------------------

This issue says is affects 2.2.9 and was fixed in 2.2.8.  Illogical combination.  The 2.2.9 is likely a mistake as that release didn't exist when this issue was created.  I just tried to change it to 2.2.7, but that wasn't available on the menu.

> A TCP stack does not correctly detect failure (pulled cable) for certain TCPPING configurations
> -----------------------------------------------------------------------------------------------
>
>                 Key: JGRP-39
>                 URL: http://jira.jboss.com/jira/browse/JGRP-39
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.2.9
>            Reporter: Ovidiu Feodorov
>         Assigned To: Ovidiu Feodorov
>             Fix For: 2.2.8
>
>
> Physical hosts "A" (192.168.1.1, coordinator) and "B" (192.168.1.2) run JGroups processes configured with TCP/TCPPING stacks.
> "A" stack configuration:
> TCP(bind_addr=192.168.1.1;start_port=11800;loopback=true):
> TCPPING(initial_hosts=192.168.1.2[11800];port_range=3;timeout=3500;num_initial_members=3;up_thread=true;down_thread=true):
> MERGE2(min_interval=5000;max_interval=10000):
> FD(shun=true;timeout=1500;max_tries=3;up_thread=true;down_thread=true):
> VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
> pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
> pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
> pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=false;down_thread=true;up_thread=true)
> "B" stack configuration:
> TCP(bind_addr=192.168.1.2;start_port=11800;loopback=true):
> TCPPING(initial_hosts=192.168.1.1[11800];port_range=3;timeout=3500;num_initial_members=3;up_thread=true;down_thread=true):
> MERGE2(min_interval=5000;max_interval=10000):
> FD(shun=true;timeout=1500;max_tries=3;up_thread=true;down_thread=true):
> VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
> pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
> pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
> pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=false;down_thread=true;up_thread=true)
> If I pull the cable under B, the B stack immediately and correctly indentifies A as suspect and installs a new view containing itself only.
> However, A does not recognizes B as suspect and undeterministically spews out various info and warning messages. The view (A, B) stays incorrectly "valid" for a long time; sometimes gets replaced by (A), sometimes not.
> I tracked down the cause of the problem down to the A TCPPING configuration and  TCP queue . If A's TCPPING is configured with a port_range=1, the problem goes away and the new view immediately installs into the A stack. It seems that if there are messages in the TCP queue except the SUSPECT message generated by FD, they mess up things and the SUSPECT message gets stuck in the queue, with undeterministic results.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira