[infinispan-issues] [JBoss JIRA] (ISPN-6402) Default GMS.join_timeout is too long

Thu Mar 31 06:10:00 EDT 2016

     [ https://issues.jboss.org/browse/ISPN-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tristan Tarrant updated ISPN-6402:
----------------------------------
    Git Pull Request: https://github.com/infinispan/infinispan/pull/4188, https://github.com/infinispan/infinispan/pull/4199  (was: https://github.com/infinispan/infinispan/pull/4188, https://github.com/infinispan/infinispan/pull/4199, https://github.com/infinispan/infinispan/pull/4175)


> Default GMS.join_timeout is too long
> ------------------------------------
>
>                 Key: ISPN-6402
>                 URL: https://issues.jboss.org/browse/ISPN-6402
>             Project: Infinispan
>          Issue Type: Task
>          Components: Core, Server, Test Suite - Server
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>            Priority: Minor
>             Fix For: 9.0.0.Final, 9.0.0.Alpha1, 8.2.1.Final
>
>
> {{GMS.join_timeout}} is used by JGroups for two purposes:
> # Wait for {{FIND_INITIAL_MBRS}} responses. If other nodes are running, but they don't answer within {{join_timeout}} ms, the node will start a new partition by itself. 
> # If no other nodes are running when the request is sent, but another node starts and sends its own discovery request within {{join_timeout}}, the initial cluster view will contain both nodes, but this isn't really useful in Infinispan (we have {{gcb.transport().initialClusterSize()}} instead).
> # Once a coordinator is located, the node sends a join request and waits for a response for {{join_timeout}} ms. After a timeout, the node re-sends the join request (up to a maximum of {{max_join_attempts}}, which defaults to 10).
> The default {{GMS.join_timeout}} in Infinispan is 15000, vs. 2000 in JGroups (actually 3000 in {{GMS}} itself, but 2000 in the example configurations).
> The higher timeout will only help us when a node is running, but it's inaccessible (e.g. because of a long GC) at the exact time a node is joining. I'd argue that applications that can tolerate multi-second pauses would be better served by {{gcb.transport().initialClusterSize(2)}} and/or an external discovery mechanism (e.g. {{FILE_PING}}, or something based on the WildFly domain controller). For most applications, the current default means just a 15s delay every time the cluster is (re)started.
> In particular, because our integration tests use the default configuration, it means a delay of 15s for every test that starts a cluster.


--
This message was sent by Atlassian JIRA
(v6.4.11#64026)