[jboss-jira] [JBoss JIRA] (JGRP-1902) Simplify failure detection and merge timeout configuration

Tue Feb 17 06:27:49 EST 2015

    [ https://issues.jboss.org/browse/JGRP-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040795#comment-13040795 ] 

Bela Ban commented on JGRP-1902:
--------------------------------

Re a value for max detection time: this is {{timeout}} in {{FD_ALL}} or {{FD_ALL2}}. However, there are no guarantees with respect to max failure detection times as messages might get lost: failure detection protocols cannot use reliable messages are they're below {{NAKACK2}} and {{UNICAST3}} in the stack.

This is not the same in {{FD}} as the max detection time is {{timeout}} * {{max_tries}} multiplied by the number of adjacent members that failed (if any). However, I don't want to change {{FD}} as {{FD_ALL}} or {{FD_ALL2}} should be used instead anyway.

[1] http://www.jgroups.org/manual/index.html

> Simplify failure detection and merge timeout configuration
> ----------------------------------------------------------
>
>                 Key: JGRP-1902
>                 URL: https://issues.jboss.org/browse/JGRP-1902
>             Project: JGroups
>          Issue Type: Enhancement
>    Affects Versions: 3.6
>            Reporter: Dan Berindei
>            Assignee: Bela Ban
>            Priority: Minor
>             Fix For: 3.6.2, 4.0
>
>
> FD/FD_ALL/FD_ALL2/FD_SOCK javadoc doesn't give any guidance as to how long it would take to detect a leaving member. MERGE2/MERGE3 javadoc also doesn't say how much it would take to detect that the network has healed.
> For an example of how misleading the current settings can be, I have seen MERGE3 take more than 20s to merge two partitions with min_interval=1000 and max_interval=5000. FD also detects a leaver after {{timeout * max_tries}} in the best case, and twice that if 2 consecutive nodes (in the members list) leave at the same time.
> The maximum time it takes to detect a leaver is of particular interest to Infinispan users, because Infinispan is supposed to protect against nodes leaving. But if the users don't configure a high enough RPC timeout in Infinispan, we don't get to detect the node leaving.
> Ideally, the user should be able to specify a maximum detection time, and the protocol should adjust the existing settings to meet that (most of the time).

--
This message was sent by Atlassian JIRA
(v6.3.11#6341)