[JBoss JIRA] (ISPN-5420) Thread pools are depleted by ClusterTopologyManagerImpl.waitForView() and causing deadlock

Thursday, 7 May 2015

     [
https://issues.jboss.org/browse/ISPN-5420?page=com.atlassian.jira.plugin....
]

Dan Berindei updated ISPN-5420:
-------------------------------
              Status: Pull Request Sent  (was: Open)
    Git Pull Request: https://github.com/infinispan/infinispan/pull/3434

The fix for ISPN-5106 already fixed the waitForView() problem partially, but it wasn't
enough. When the coordinator installed two views in quick succession, the thread updating
the cache members list for the first view would block waiting for the
{{CacheTopologyControlCommand(POLICY_GET_STATUS)}} response from the other members. Then,
because the other members got the newer view before sending the join requests, all the
remote-executor and OOB threads would block in {{waitForView()}}, and there would be no
way to receive the {{POLICY_GET_STATUS}} responses (since processing a response also needs
an OOB thread). The solution was to update the cache members asynchronously.

Testing with a limited amount of OOB/remote-executor threads also exposed some other
deadlocks, and the pull request tries to plug as many of them as possible. However,
because the caches will not start in the same order on every node, there is always the
possibility of 2 nodes sending state transfer requests to each other (for different
caches) and not being able to process the responses (because the OOB threads are all
blocked, actually waiting for those responses). So a deadlock is still possible if
{{remote-executor.max-threads + OOB.max_threads < number of caches}}.

...
 Thread pools are depleted by ClusterTopologyManagerImpl.waitForView()
and causing deadlock

------------------------------------------------------------------------------------------

                 Key: ISPN-5420
                 URL: https://issues.jboss.org/browse/ISPN-5420
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 6.0.2.Final, 7.1.1.Final
            Reporter: Dan Berindei
            Assignee: Dan Berindei
            Priority: Critical
             Fix For: 8.0.0.Final

 The join process was designed in the idea that a node would start its caches in
sequential order, so {{ClusterTopologyManager.waitForView()}} would block at most once for
each joining node. However, WildFly actually starts {{2 * Runtime.availableProcessors()}}
caches in parallel, and this can be a problem when the machine has a lot of cores and
multiple nodes.
 {{ClustertopologyManager.handleClusterView()}} only updates the {{viewId}} after it
updated the cache topologies of each cache AND after it confirmed the availability of all
the nodes with a {{POLICY_GET_STATUS}} RPC. This RPC can block, and it's very easy for
the remote-executor thread pool on the coordinator to become overloades with threads like
this:
 {noformat}
 "remote-thread-172" daemon prio=10 tid=0x00007f0cc48c0000 nid=0x28ca4 in
Object.wait() [0x00007f0c5f25b000]
    java.lang.Thread.State: TIMED_WAITING (on object monitor)
         at java.lang.Object.wait(Native Method)
         at
org.infinispan.topology.ClusterTopologyManagerImpl.waitForView(ClusterTopologyManagerImpl.java:357)
         - locked <0x00000000ff3bd900> (a java.lang.Object)
         at
org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123)
         at
org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:162)
         at
org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:144)
         at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:276)
 {noformat} 

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009