[JBoss JIRA] (ISPN-2986) Intermittent failure to start new nodes during heavy write load

Tuesday, 2 April 2013

    [
https://issues.jboss.org/browse/ISPN-2986?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-2986:
------------------------------------

The number of OOB threads in your JGroups configuration is too small. 

While a node is joining, the other nodes will forward all commands to the joiner, but the
joiner can't process those commands until it receives the cache topology from the
coordinator. When there are few OOB threads, the forwarded commands can block all the
threads and the joiner will never be able to process the cache topology from the
coordinator - a deadlock.

With the ISPN-2808 fix, commands are moved to a "remote commands" thread pool,
but that thread pool still "overflows" to the OOB thread pool, so it has to have
an available thread for each incoming command in order to avoid the deadlock.

We are currently working on ISPN-2849, which will avoid the deadlock by not scheduling
commands to the remote commands thread pool until the proper topology has been installed
(thus freeing the threads for installing the topology).

...
 Intermittent failure to start new nodes during heavy write load
 ---------------------------------------------------------------

                 Key: ISPN-2986
                 URL: https://issues.jboss.org/browse/ISPN-2986
             Project: Infinispan
          Issue Type: Bug
          Components: Distributed Cache, Server
    Affects Versions: 5.2.5.Final
         Environment: 4 servers running linux 2.6.32-220.13.1.el6.x86_64 with 2x QuadCore
2.4ghz CPU's
 Gigabit ethernet, same switch.
 java version "1.7.0"
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
            Reporter: Marc Bridner
            Assignee: Tristan Tarrant
         Attachments: logs.zip, test-infinispan.xml, test-jgroups.xml,
test.infinispan.zip

 When under heavy write load from a hotrod client with 64+ threads and a new node is
started, the new node will sometimes fail to start, eventually giving off state transfer
timeouts and finally terminating. During the time it takes it to time out (~10 minutes)
the hotrod client is totally blocked.
 Setup is as follows:
 3 servers, 1 client
  * dl380x2385, 10.64.106.21, client
  * dl380x2384, 10.64.106.20, first node
  * dl380x2383, 10.64.106.19, second node
  * dl380x2382, 10.64.106.18, third node
 2 caches, initial state transfer off, transactions on, config is attached.
 Small app that triggers the problem is also attached.
 Steps to reproduce:
 1. Start first node
 2. Start client, wait for counter to reach 50000 (in client)
 3. Start second node. 10% chance it'll fail.
 4. Wait for counter to reach 100000 in client.
 5. Start third node, 50% chance it'll fail.
 If it doesn't fail, terminate everything and start over.
 I realize this may be hard to reproduce, so if any more logs or tests are needed, let me
know.
 I've been unable to reproduce it on a single physical machine, and it only occurs
when using more than 64 client threads. Changing the ratio of writes between the caches
also seems to make it not occur. I was unable to reproduce it with TRACE log level on (too
slow), but if you can specify some packages that you want traces of, that might work.
 Turning transactions off makes it worse, 90% chance to fail on second node. Funny enough,
disabling the concurrent GC lowers the failure rate to 10% on third node. Guessing race
condition somewhere, may be similar to ISPN-2982. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009