On Fri, Mar 16, 2012 at 9:27 AM, Bela Ban <bban(a)redhat.com> wrote:
On 3/15/12 2:36 PM, Dan Berindei wrote:
>> If we touch a lot of keys, then sending *all* of the keys to all owners
>> may be sub-optimal; as an optimization, we may want to send only the
>> keys to the nodes which need to store them. This would make the PREPARES
>> potentially much smaller.
>>
>
> Agree, but it's a non-trivial optimization. For instance if there is a
> view change between the prepare and the commit, the recipient of the
> commit may not have all the modifications in the list.
Can't we treat this as 2 state transfers ?
I was talking about the originator of the transaction sending the
prepare command before a view change and the commit command after the
view change (to the new owners).
I'm not sure how 2 state transfers would solve this problem. State
transfer does not copy the actual values (modifications list) of a
pending transaction to the new owners, only the list of possibly
locked keys. So the new owner either has the entire mods list (it
received the prepare command), or it doesn't have anything (because it
didn't). With partial mods lists sent to all the owner, the originator
will basically have to resend the (partial) mods lists to everyone
involved inside the commit command.
If state transfer did copy the pending transactions list complete with
mods lists, we'd again need a process on the receiver to merge the
mods lists received from multiple old owners into a single mods list
for a particular transaction. No need for 2 state transfers however.
The first sends the key set to all affected nodes. The key set may
not
be accurate, in that some keys are missing, and others are not needed.
But it is the *recipient* of the state transfer which discards unneeded
keys (which it shouldn't own).
What is this key set? Where do we get it from?
The missing keys will then be sent in the second state transfer.
How do we guarantee that the second state transfer will contain *all*
the missing keys? What makes it different from the first state
transfer?
This schem would batch state transfers in a queue: if we have more
than
1 ST, we combine them. E.g. if we have view V5,V6,V7 in the queue, then
we initiate a state transfer that transfers keys based on the diff
between V5 and V7. If later a V8 is inserted into the queue, we batch
the diffs between V7 and V8.
We already do queue state transfers if all the view changes are joins.
But for leaves and merges we interrupt the current state transfer and
restart it "from the top".
Say we have V5[A,B], V6[A, B, C], V7[A, B, C, D] queued. We do a
single state transfer, V5->V7. But when we receive V8[B, C, D], the
blocking design interrupts the state transfer, rolls back to V5, and
then starts a new state transfer V5->V8.
The non-blocking design I proposed does basically the same thing,
except it removes the rollback to V5. So we start a state transfer
V5->V7, we receive the V8 with a leaver, and we replace the state
transfer in progress with another state transfer V5->V8.
> With recovery enabled the originator would also have to get the list
> of modifications from all the targets and combine it into a single
> PREPARE command, much more complex than what happens now.
Is this really necessary ? Wouldn't the above schem work as well, and
still be correct ?
The recovery information is stored in a separate recover cache, which
is not clustered for performance (and we recommend using a cache store
only with passivation, for the same reason). So upon startup the
originator will have to get partial tx mods lists from the key owners
and piece them together to get the original tx back. Not an impossible
task, mind you, but added complexity nevertheless.
>> I meant TX rollbacks due to overlapping locks at different
nodes, the
>> stuff Pedro wrote about in his paper on total order.
>>
>
> Hmm, I thought because we sort the keys before locking it shouldn't be
> possible to have deadlocks between prepare commands. I was assuming
> that the Tx aborts in Pedro's tests were due to write skew check
> failures, but I just read his message again and he mentions write skew
> check is disabled.
> I must be missing something...
Even if we sort the keys, we can still have deadlocks, caused by
PREPAREs arriving at different times at different nodes, e.g. the
classic deadlock of TX1 acquiring locks on A, TX2 acquiring locks on B,
then TX1 acquiring locks on B and TX2 acquiring locks on A.
Yeah, Pedro's example was very clear. I missed it because I didn't
realize we do the sorting on the recipient of the prepare command, so
it only works between the keys owned by that node.
Cheers
Dan