[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Wed Jul 30 07:59:33 EDT 2014

On Wed, Jul 30, 2014 at 12:22 PM, Radim Vansa <rvansa at redhat.com> wrote:

>
>   Investigation:
>> ------------
>> When I looked at UNICAST3, I saw a lot of missing messages on the
>> receive side and unacked messages on the send side. This caused me to
>> look into the (mainly OOB) thread pools and - voila - maxed out !
>>
>> I learned from Pedro that the Infinispan internal thread pool (with a
>> default of 32 threads) can be configured, so I increased it to 300 and
>> increased the OOB pools as well.
>>
>> This mitigated the problem somewhat, but when I increased the requester
>> threads to 100, I had the same problem again. Apparently, the Infinispan
>> internal thread pool uses a rejection policy of "run" and thus uses the
>> JGroups (OOB) thread when exhausted.
>>
>
>  We can't use another rejection policy in the remote executor because the
> message won't be re-delivered by JGroups, and we can't use a queue either.
>
>
> Can't we just send response "Node is busy" and cancel the operation? (at
> least in cases where this is possible - we can't do that safely for
> CommitCommand, but usually it could be doable, right?) And what's the
> problem with queues, besides that these can grow out of memory?
>

No commit commands here, the cache is not transactional :)

If the remote thread pool gets full on a backup node, there is no way to
safely cancel the operation - other backup owners may have already applied
the write. And even with numOwners=2, there are multiple backup owners
during state transfer.

We do throw an OutdatedTopologyException on the backups and retry the
operation when the topology changes, we could do something similar when the
remote executor thread pool is full. But 1) we have trouble preserving
consistency when we retry, so we'd rather do it only when we really have
to, and 2) repeated retries can be costly, as the primary needs to
re-acquire the lock.

The problem with queues is that commands are executed in the order they are
in the queue. If a node has a remote executor thread pool of 100 threads
and receives a prepare(tx1, put(k, v1) comand, then 1000 prepare(tx_i,
put(k, v_i)) commands, and finally a commit(tx1) command, the commit(tx1)
command will block until all but 99 of the the prepare(tx_i, put(k, v_i))
commands have timed out.

I have some thoughts on improving that independently of Pedro's work on
locking [1], and I've just written that up as ISPN-4585 [2]

[1] https://issues.jboss.org/browse/ISPN-2849
[2] https://issues.jboss.org/browse/ISPN-4585

>
>
> Radim
>
> --
> Radim Vansa <rvansa at redhat.com> <rvansa at redhat.com>
> JBoss DataGrid QA
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140730/406b4d10/attachment-0001.html