[infinispan-issues] [JBoss JIRA] (ISPN-4873) Statetransfer thread pool deadlock

Thu Oct 23 15:57:36 EDT 2014

    [ https://issues.jboss.org/browse/ISPN-4873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014714#comment-13014714 ] 

Dan Berindei commented on ISPN-4873:
------------------------------------

1. The node received a lot of {{StateRequestCommand(START_STATE_TRANSFER, topologyId)}} commands before it received the {{CacheTopologyControlCommand(REBALANCE_START, topologyId)}}. {{StateRequestCommand.perform()}} blocks waiting for the new topology id, but there is no OOB thread free to process the rebalance command (because each state request commands blocks an remote thread, or an OOB thread when the remote pool gets full). 

I think we can fix this by making {{StateRequestCommand}} implement {{TopologyAffectedCommand}}, that way state request commands will not block any thread while waiting for the new topology.

I also think we could make the remote executor pool reject new tasks when all the threads are busy, instead of running them in the caller thread. {{BlockingTaskAwareExecutorServiceImpl}} has some extra code to handle rejection exceptions and retry the task. However, that depends on ISPN-4675.

2. Here, the coordinator started processing state request commands, but each state request blocks either a thread from the transport executor pool while it iterates over the data container _and_ while it waits for a response from the state requestor(s). If all the transport executor threads are busy, the OOB thread that received the request will block, because the queue size is just 1. If all the OOB threads are blocked, it's impossible to receive a response from the state requestors.

At least in theory, the thread sending state could be released while waiting for the respones from the state requestor. However, it would make the code harder to reason about, so I wouldn't push it for 6.4. Instead, I would suggest increasing the queue size for the transport executor to avoid blocking an OOB thread for each state transfer. I won't guarantee that it won't break something else, but it's worth trying.

> Statetransfer thread pool deadlock
> ----------------------------------
>
>                 Key: ISPN-4873
>                 URL: https://issues.jboss.org/browse/ISPN-4873
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 6.0.2.Final
>            Reporter: Takayoshi Kimura
>            Assignee: Dan Berindei
>
> During massive state transfer with 300 nodes and 3000 caches, the OOB and/or infinispan thread pool gets deadlock, similar to ISPN-2808.
> The thread pool is configured with max 1400 threads now and increasing them is not a realistic workaround as the user is planning to add more caches.

--
This message was sent by Atlassian JIRA
(v6.3.1#6329)