[infinispan-dev] Let me understand DIST

Bela Ban bban at redhat.com
Thu Mar 15 04:31:05 EDT 2012



On 3/12/12 3:03 PM, Dan Berindei wrote:
> On Sat, Mar 10, 2012 at 7:07 PM, Bela Ban<bban at redhat.com>  wrote:
>> Can you confirm that my understanding of how DIST works is correct ?
>>
>>
>> #1 Non transactional modifications
>> - E.g. a PUT
>> - The PUT is sent to the primary owner (e.g. B) and all backup owners
>> (e.g. C) and is applied immediately, with only local lock acquisition
>> (lock-put-unlock)
>
> This is correct, the key lock is acquired in parallel on all the
> owners, so two concurrent PUTs could result in different owners having
> different values.


So the section that issues the (say) 2 unicast PUTs isn't protected by a 
lock ? I'm not suggesting to include the waiting for the results from 
the 2 futures should be synchronized, but only the *sending* of the 2 
unicasts. Then if thread 1 acquired that lock first and thread 2 second, 
thread's 1's modifications would be delivered *before* thread 2's 
modifications.
Of course, this would order only messages from the same sender.


>> (or is the PUT only sent to B, which in turn then updates C?)
>>
>>
>> #2 Transactional modifications
>> - The modifications involve a bunch of keys
>> - When the TX commits:
>>    - A PREPARE message with the relevant keys is sent to all primary
>> owners P (to the backup owners as well?)
>
> The PREPARE is sent to the backup owners as well.
> The complete list of modifications is sent to all the owners, primary or backup.


OK, we discussed this in IRC, thanks for the clarification.

If we touch a lot of keys, then sending *all* of the keys to all owners 
may be sub-optimal; as an optimization, we may want to send only the 
keys to the nodes which need to store them. This would make the PREPARES 
potentially much smaller.



>> If so, then I can assume that a transactional modification touching a
>> number of keys will almost always touch *all* nodes ? Example:
>> - We have 10 nodes
>> - numOwners = 2
>> - If we have a good consistent hash, I can assume that I have to modifiy
>> 5 different keys (10 / 2) on average in a TX to touch *all* nodes in the
>> cluster with the PREPARE/COMMIT phase, correct ?
>>
>
> You have to modify *at least* 5 different keys to touch all the nodes.
> Some keys will share the primary owner, but also one key's primary
> owner can be another's backup owner - so the chances to touch all
> nodes with only 5 keys are very low.
>
> It turns out the probability of a tx touching all the nodes depends on
> whether we have virtual nodes enabled or not - with lots of virtual
> nodes all the {(i, j) where 0<= i != j<  numNodes} combinations are
> valid, whereas without virtual nodes only numNodes combinations are
> allowed - {(i, (i+1)%clusterSize) where 0<= i<  numNodes}.
>
> I've run some simulations with IPython assuming virtual nodes, and
> these are the number of keys per tx required to touch all the nodes
> 50% of the time for a few representative cluster sizes:
> 4: 4, 8: 10, 16: 25, 32: 61


So we have to touch roughly clusterSize * 2 keys to send modification 
requests to all nodes.


> These are the numbers of keys required to touch numNodes/2 nodes 50%
> of the time:
> 4: 1, 8: 2, 16: 5, 32: 11


OK


> And these are the numbers of keys required to touch numNodes/2 nodes
> 90% of the time:
> 4: 1, 8: 3, 16: 7, 32: 13
>
>
>> If my last statement is correct, is it safe to assume that with DIST and
>> transactional modifications, I will have a lot of TX contention /
>> collisions ?
>>
>
> Not sure what you mean by lot of TX contention - lock contention
> should only depend on the dataset size, unless we use lock striping,
> in which case it depends on the configured concurrency level.


I meant TX rollbacks due to overlapping locks at different nodes,  the 
stuff Pedro wrote about in his paper on total order.


> Network traffic would probably increase quadratically with the number
> of keys/tx because we do send all the modifications to all the owners,
> but we could fix that by either transmitting only the relevant bits to
> each owner


IMO that's an optimization that should be done regardless, see my 
comment above.


> or by switching to multicast after the number of targets hits a threshold.


The problem of switching between unicasts and multicasts is that there 
is no ordering between unicast and multicast messages. So if you send an 
anycast  M1 (2 unicasts) to B,C, followed by a multicast M2, M1 or M2 
could be delivered first. So A could deliver M1 first, folllowed by M2, 
while B would deliver M2 followed by M1.

If the above scenario is not a problem because you acquire locks anyway, 
we could do it though... might be a nice optimization !


>> If this is correct, this would IMO lay even more importance onto the
>> work done by the Cloud-TM team, replacing 2PC with total order. Also, if
>> we touch almost all nodes, would it make sense to use SEQUENCER for
>> *all* updates ? Would this obviliate the need for TOM (total order for
>> partial replication) ?
>>
>
> I don't enough about Cloud-TM to comment here, but it appears
> SEQUENCER requires the coordinator to receive forward requests from
> all the nodes and send nothing back on the unicast channel.


Yes, this is basically the same problem: with SEQUENCER, there is no 
ordering between unicasts and multicasts, either.



> Sending the whole tx as a multicast would certainly be more efficient
> than what we do now with lots of targets.


Just as confirmation of my understanding:
- We have {A,B,C,D,E}
- A TX modifies keys K1 (maps to A:B), K2 (maps to B:C), K5 (maps to E:A)

Do we send 3 anycasts A:B, B:C and E:A or do we send 1 anycast A:B:C:E ? 
I think it's the latter, but please confirm...


-- 
Bela Ban, JGroups lead (http://www.jgroups.org)


More information about the infinispan-dev mailing list