On Thu, Mar 15, 2012 at 10:31 AM, Bela Ban <bban(a)redhat.com> wrote:
On 3/12/12 3:03 PM, Dan Berindei wrote:
> On Sat, Mar 10, 2012 at 7:07 PM, Bela Ban<bban(a)redhat.com> wrote:
>> Can you confirm that my understanding of how DIST works is correct ?
>>
>>
>> #1 Non transactional modifications
>> - E.g. a PUT
>> - The PUT is sent to the primary owner (e.g. B) and all backup owners
>> (e.g. C) and is applied immediately, with only local lock acquisition
>> (lock-put-unlock)
>
> This is correct, the key lock is acquired in parallel on all the
> owners, so two concurrent PUTs could result in different owners having
> different values.
So the section that issues the (say) 2 unicast PUTs isn't protected by a
lock ? I'm not suggesting to include the waiting for the results from
the 2 futures should be synchronized, but only the *sending* of the 2
unicasts. Then if thread 1 acquired that lock first and thread 2 second,
thread's 1's modifications would be delivered *before* thread 2's
modifications.
Of course, this would order only messages from the same sender.
We do acquire the key lock on the originator as well, I just missed it
when I read the code.
>> (or is the PUT only sent to B, which in turn then updates C?)
>>
>>
>> #2 Transactional modifications
>> - The modifications involve a bunch of keys
>> - When the TX commits:
>> - A PREPARE message with the relevant keys is sent to all primary
>> owners P (to the backup owners as well?)
>
> The PREPARE is sent to the backup owners as well.
> The complete list of modifications is sent to all the owners, primary or backup.
OK, we discussed this in IRC, thanks for the clarification.
If we touch a lot of keys, then sending *all* of the keys to all owners
may be sub-optimal; as an optimization, we may want to send only the
keys to the nodes which need to store them. This would make the PREPARES
potentially much smaller.
Agree, but it's a non-trivial optimization. For instance if there is a
view change between the prepare and the commit, the recipient of the
commit may not have all the modifications in the list.
With recovery enabled the originator would also have to get the list
of modifications from all the targets and combine it into a single
PREPARE command, much more complex than what happens now.
>> If so, then I can assume that a transactional modification touching a
>> number of keys will almost always touch *all* nodes ? Example:
>> - We have 10 nodes
>> - numOwners = 2
>> - If we have a good consistent hash, I can assume that I have to modifiy
>> 5 different keys (10 / 2) on average in a TX to touch *all* nodes in the
>> cluster with the PREPARE/COMMIT phase, correct ?
>>
>
> You have to modify *at least* 5 different keys to touch all the nodes.
> Some keys will share the primary owner, but also one key's primary
> owner can be another's backup owner - so the chances to touch all
> nodes with only 5 keys are very low.
>
> It turns out the probability of a tx touching all the nodes depends on
> whether we have virtual nodes enabled or not - with lots of virtual
> nodes all the {(i, j) where 0<= i != j< numNodes} combinations are
> valid, whereas without virtual nodes only numNodes combinations are
> allowed - {(i, (i+1)%clusterSize) where 0<= i< numNodes}.
>
> I've run some simulations with IPython assuming virtual nodes, and
> these are the number of keys per tx required to touch all the nodes
> 50% of the time for a few representative cluster sizes:
> 4: 4, 8: 10, 16: 25, 32: 61
So we have to touch roughly clusterSize * 2 keys to send modification
requests to all nodes.
Actually the number of required keys is growing a little faster. The
chances of 2 * clusterSize keys touching all the nodes are
2: 1.0,
4: 0.985,
8: 0.91,
16: 0.796,
32: 0.585,
64: 0.327,
128: 0.092
> These are the numbers of keys required to touch numNodes/2 nodes 50%
> of the time:
> 4: 1, 8: 2, 16: 5, 32: 11
OK
> And these are the numbers of keys required to touch numNodes/2 nodes
> 90% of the time:
> 4: 1, 8: 3, 16: 7, 32: 13
>
>
>> If my last statement is correct, is it safe to assume that with DIST and
>> transactional modifications, I will have a lot of TX contention /
>> collisions ?
>>
>
> Not sure what you mean by lot of TX contention - lock contention
> should only depend on the dataset size, unless we use lock striping,
> in which case it depends on the configured concurrency level.
I meant TX rollbacks due to overlapping locks at different nodes, the
stuff Pedro wrote about in his paper on total order.
Hmm, I thought because we sort the keys before locking it shouldn't be
possible to have deadlocks between prepare commands. I was assuming
that the Tx aborts in Pedro's tests were due to write skew check
failures, but I just read his message again and he mentions write skew
check is disabled.
I must be missing something...
> Network traffic would probably increase quadratically with the number
> of keys/tx because we do send all the modifications to all the owners,
> but we could fix that by either transmitting only the relevant bits to
> each owner
IMO that's an optimization that should be done regardless, see my
comment above.
It does complicate other parts of the code, so I don't think it's such
a clear win.
> or by switching to multicast after the number of targets hits a threshold.
The problem of switching between unicasts and multicasts is that there
is no ordering between unicast and multicast messages. So if you send an
anycast M1 (2 unicasts) to B,C, followed by a multicast M2, M1 or M2
could be delivered first. So A could deliver M1 first, folllowed by M2,
while B would deliver M2 followed by M1.
If the above scenario is not a problem because you acquire locks anyway,
we could do it though... might be a nice optimization !
>> If this is correct, this would IMO lay even more importance onto the
>> work done by the Cloud-TM team, replacing 2PC with total order. Also, if
>> we touch almost all nodes, would it make sense to use SEQUENCER for
>> *all* updates ? Would this obviliate the need for TOM (total order for
>> partial replication) ?
>>
>
> I don't enough about Cloud-TM to comment here, but it appears
> SEQUENCER requires the coordinator to receive forward requests from
> all the nodes and send nothing back on the unicast channel.
Yes, this is basically the same problem: with SEQUENCER, there is no
ordering between unicasts and multicasts, either.
> Sending the whole tx as a multicast would certainly be more efficient
> than what we do now with lots of targets.
Just as confirmation of my understanding:
- We have {A,B,C,D,E}
- A TX modifies keys K1 (maps to A:B), K2 (maps to B:C), K5 (maps to E:A)
Do we send 3 anycasts A:B, B:C and E:A or do we send 1 anycast A:B:C:E ?
I think it's the latter, but please confirm...
Nope, we haven't switched to anycasts yet.
We do serialize the command beforehand, so we don't have triple
serialization overhead, but we do need an extra FutureCollator object.
My assumption is that the multicast itself is more efficient, because
from the originator to the switch it's just one packet. I wasn't
thinking of the extra stuff Infinispan and JGroups have to keep for
the multiple unicast requests.
Cheers
Dan