On Wed, Jul 30, 2014 at 12:22 PM, Radim Vansa <rvansa@redhat.com> wrote:

Investigation:
------------
When I looked at UNICAST3, I saw a lot of missing messages on the
receive side and unacked messages on the send side. This caused me to
look into the (mainly OOB) thread pools and - voila - maxed out !

I learned from Pedro that the Infinispan internal thread pool (with a
default of 32 threads) can be configured, so I increased it to 300 and
increased the OOB pools as well.

This mitigated the problem somewhat, but when I increased the requester
threads to 100, I had the same problem again. Apparently, the Infinispan
internal thread pool uses a rejection policy of "run" and thus uses the
JGroups (OOB) thread when exhausted.

We can't use another rejection policy in the remote executor because the message won't be re-delivered by JGroups, and we can't use a queue either.

Can't we just send response "Node is busy" and cancel the operation? (at least in cases where this is possible - we can't do that safely for CommitCommand, but usually it could be doable, right?) And what's the problem with queues, besides that these can grow out of memory?

No commit commands here, the cache is not transactional :)

If the remote thread pool gets full on a backup node, there is no way to safely cancel the operation - other backup owners may have already applied the write. And even with numOwners=2, there are multiple backup owners during state transfer.

We do throw an OutdatedTopologyException on the backups and retry the operation when the topology changes, we could do something similar when the remote executor thread pool is full. But 1) we have trouble preserving consistency when we retry, so we'd rather do it only when we really have to, and 2) repeated retries can be costly, as the primary needs to re-acquire the lock.

The problem with queues is that commands are executed in the order they are in the queue. If a node has a remote executor thread pool of 100 threads and receives a prepare(tx1, put(k, v1) comand, then 1000 prepare(tx_i, put(k, v_i)) commands, and finally a commit(tx1) command, the commit(tx1) command will block until all but 99 of the the prepare(tx_i, put(k, v_i)) commands have timed out.

I have some thoughts on improving that independently of Pedro's work on locking [1], and I've just written that up as ISPN-4585 [2]

[1] https://issues.jboss.org/browse/ISPN-2849

[2] https://issues.jboss.org/browse/ISPN-4585

Radim

-- Radim Vansa <rvansa@redhat.com> JBoss DataGrid QA
_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev