[infinispan-dev] Let me understand DIST

Mon Mar 12 10:03:12 EDT 2012

On Sat, Mar 10, 2012 at 7:07 PM, Bela Ban <bban at redhat.com> wrote:
> Can you confirm that my understanding of how DIST works is correct ?
>
>
> #1 Non transactional modifications
> - E.g. a PUT
> - The PUT is sent to the primary owner (e.g. B) and all backup owners
> (e.g. C) and is applied immediately, with only local lock acquisition
> (lock-put-unlock)

This is correct, the key lock is acquired in parallel on all the
owners, so two concurrent PUTs could result in different owners having
different values.

> (or is the PUT only sent to B, which in turn then updates C?)
>
>
> #2 Transactional modifications
> - The modifications involve a bunch of keys
> - When the TX commits:
>   - A PREPARE message with the relevant keys is sent to all primary
> owners P (to the backup owners as well?)

The PREPARE is sent to the backup owners as well.
The complete list of modifications is sent to all the owners, primary or backup.

>   - All primary owners try to apply the modifications, acquiring locks

On PREPARE the primary owners acquire locks but do nothing else. The
backup owners just register the transaction in their transaction
table.

>   - If all primary owners can successfully apply all of the
> modifications, the TX commits, else it rolls back

The TM makes the decision to commit or not, as other non-Infinispan
resources could be involved in the transaction. But all the PREPAREs
on the primary owners must succeed acquiring their locks or the TM
will roll back the transaction.

>   - After a successful TX, the primary owners update the backup owners:
> here, I'm probably wrong, and this is done *inside* of the TX scope, right ?
>

On COMMIT, all the owners apply the modifications and return successfully.
If an owner doesn't have the list of modifications (because it became
an owner for one or more keys after the PREPARE), the originator will
resend the full modifications list to that owner.

After the COMMIT succeeded on all the owners, the originator sends a
TXCOMPLETION command only to the primary owners, which then release
the key locks.

I think setting syncCommitPhase==false doesn't change any of this,
it's just that the user thread returns before receiving the COMMIT
response.

>
> So is my understanding of #1 and #2 is correct ?
>

Close enough :)

> If so, then I can assume that a transactional modification touching a
> number of keys will almost always touch *all* nodes ? Example:
> - We have 10 nodes
> - numOwners = 2
> - If we have a good consistent hash, I can assume that I have to modifiy
> 5 different keys (10 / 2) on average in a TX to touch *all* nodes in the
> cluster with the PREPARE/COMMIT phase, correct ?
>

You have to modify *at least* 5 different keys to touch all the nodes.
Some keys will share the primary owner, but also one key's primary
owner can be another's backup owner - so the chances to touch all
nodes with only 5 keys are very low.

It turns out the probability of a tx touching all the nodes depends on
whether we have virtual nodes enabled or not - with lots of virtual
nodes all the {(i, j) where 0 <= i != j < numNodes} combinations are
valid, whereas without virtual nodes only numNodes combinations are
allowed - {(i, (i+1)%clusterSize) where 0 <= i < numNodes}.

I've run some simulations with IPython assuming virtual nodes, and
these are the number of keys per tx required to touch all the nodes
50% of the time for a few representative cluster sizes:
4: 4, 8: 10, 16: 25, 32: 61

These are the numbers of keys required to touch numNodes/2 nodes 50%
of the time:
4: 1, 8: 2, 16: 5, 32: 11

And these are the numbers of keys required to touch numNodes/2 nodes
90% of the time:
4: 1, 8: 3, 16: 7, 32: 13

> If my last statement is correct, is it safe to assume that with DIST and
> transactional modifications, I will have a lot of TX contention /
> collisions ?
>

Not sure what you mean by lot of TX contention - lock contention
should only depend on the dataset size, unless we use lock striping,
in which case it depends on the configured concurrency level.

Network traffic would probably increase quadratically with the number
of keys/tx because we do send all the modifications to all the owners,
but we could fix that by either transmitting only the relevant bits to
each owner or by switching to multicast after the number of targets
hits a threshold.

> If this is correct, this would IMO lay even more importance onto the
> work done by the Cloud-TM team, replacing 2PC with total order. Also, if
> we touch almost all nodes, would it make sense to use SEQUENCER for
> *all* updates ? Would this obviliate the need for TOM (total order for
> partial replication) ?
>

I don't enough about Cloud-TM to comment here, but it appears
SEQUENCER requires the coordinator to receive forward requests from
all the nodes and send nothing back on the unicast channel. Based on
our tests with state transfer I think this would require us to use
either UNICAST or UNICAST2+RSVP in order to get good throughput.

> Well, probably not, because we only want to send keys to nodes that
> actually need to store them...
>

Sending the whole tx as a multicast would certainly be more efficient
than what we do now with lots of targets.
With unicasts we could send only the minimum required data to each
target, but that computation would be complex and error-prone.

> Thoughts ?
>

Not sure about replacing the current DIST_SYNC approach, but it seems
to me like it could be a very good fit for DIST_ASYNC.

Cheers
Dan