]
Sebastian Łaskawiec closed ISPN-6402.
-------------------------------------
Default GMS.join_timeout is too long
------------------------------------
Key: ISPN-6402
URL:
https://issues.jboss.org/browse/ISPN-6402
Project: Infinispan
Issue Type: Task
Components: Core, Server, Test Suite - Server
Reporter: Dan Berindei
Assignee: Dan Berindei
Priority: Minor
Fix For: 8.2.1.Final, 9.0.0.Alpha1, 9.0.0.Final
{{GMS.join_timeout}} is used by JGroups for two purposes:
# Wait for {{FIND_INITIAL_MBRS}} responses. If other nodes are running, but they
don't answer within {{join_timeout}} ms, the node will start a new partition by
itself.
# If no other nodes are running when the request is sent, but another node starts and
sends its own discovery request within {{join_timeout}}, the initial cluster view will
contain both nodes, but this isn't really useful in Infinispan (we have
{{gcb.transport().initialClusterSize()}} instead).
# Once a coordinator is located, the node sends a join request and waits for a response
for {{join_timeout}} ms. After a timeout, the node re-sends the join request (up to a
maximum of {{max_join_attempts}}, which defaults to 10).
The default {{GMS.join_timeout}} in Infinispan is 15000, vs. 2000 in JGroups (actually
3000 in {{GMS}} itself, but 2000 in the example configurations).
The higher timeout will only help us when a node is running, but it's inaccessible
(e.g. because of a long GC) at the exact time a node is joining. I'd argue that
applications that can tolerate multi-second pauses would be better served by
{{gcb.transport().initialClusterSize(2)}} and/or an external discovery mechanism (e.g.
{{FILE_PING}}, or something based on the WildFly domain controller). For most
applications, the current default means just a 15s delay every time the cluster is
(re)started.
In particular, because our integration tests use the default configuration, it means a
delay of 15s for every test that starts a cluster.