[infinispan-dev] Let me understand DIST

Thu Mar 22 10:33:14 EDT 2012

On Sun, Mar 18, 2012 at 12:22 PM, Bela Ban <bban at redhat.com> wrote:
> Hmm. Given the complexity of these issues, I'm starting to think whether
> it really makes sense to try to implement state transfer they way we are
> planning this. The track we're on leads to loads of complexity and tons
> of edge cases.
>
> We're trying to provide the C, the A and the P (of CAP), with perhaps a
> slighly degraded A. This may not be feasible... especially on large
> clusters where nodes come and go with a higher frequency than in a small
> cluster.
>

Slight correction here: we are only trying to provide C and A. We
don't really have anything to handle P yet...

> *Maybe* to a certain degree we can simplify this with total order, but
> haven't spent much thought on this.
>
> I think Manik's eventual consistency work will probably be the best
> solution...
>

I don't know yet how either total order or eventual consistency change
the state transfer process, so I wouldn't want to "jump ship" until we
know that they would really make things simpler.

> Comments inline.
>
>
> On 3/16/12 9:43 AM, Dan Berindei wrote:
>> On Fri, Mar 16, 2012 at 9:27 AM, Bela Ban<bban at redhat.com>  wrote:
>>>
>>>
>>> On 3/15/12 2:36 PM, Dan Berindei wrote:
>>>
>>>>> If we touch a lot of keys, then sending *all* of the keys to all owners
>>>>> may be sub-optimal; as an optimization, we may want to send only the
>>>>> keys to the nodes which need to store them. This would make the PREPARES
>>>>> potentially much smaller.
>>>>>
>>>>
>>>> Agree, but it's a non-trivial optimization. For instance if there is a
>>>> view change between the prepare and the commit, the recipient of the
>>>> commit may not have all the modifications in the list.
>>>
>>>
>>> Can't we treat this as 2 state transfers ?
>>>
>>
>> I was talking about the originator of the transaction sending the
>> prepare command before a view change and the commit command after the
>> view change (to the new owners).
>>
>> I'm not sure how 2 state transfers would solve this problem. State
>> transfer does not copy the actual values (modifications list) of a
>> pending transaction to the new owners, only the list of possibly
>> locked keys. So the new owner either has the entire mods list (it
>> received the prepare command), or it doesn't have anything (because it
>> didn't). With partial mods lists sent to all the owner, the originator
>> will basically have to resend the (partial) mods lists to everyone
>> involved inside the commit command.
>
>
> How about we take a note of keys that are blocked (due to other TXs) and
> exclude them from ST, but don't block on the pending TXs ?
> When the keys are available, we resend them.
> This is different from having to wait for pending TXs to complete inside
> of ST, but ST of locked keys becomes asynchronous...
> Not sure this makes sense... :-)
>

Let me see if I understand correctly:
* The state transfer doesn't send anything for the keys that are
currently locked by a transaction. It does send the information that
there is a tx locking those keys.
* If the originator wants to commit a tx, it will send the commit to
the same owners that it sent the prepare to, NOT the new owners.
* The old owners will forward the tx commit to the new owners, along
with the list of modifications made by that transaction.
* The same on rollback, the old owners will forward the rollback to
the new owners, along with the original values of the keys.

Except the 1st bullet point, not sending locked entries, this sounds
very much like my proposal to forward async commits. I don't really
like this point, because it means on rollback we have to send the
original values instead, and we don't want to keep the original values
on the old owners indefinitely.

I saw a challenge is in assembling those async commits into a single
1PC prepare command and executing it on the new owner. But perhaps it
would be feasible to execute a "partial commit" on the new owner and
remove the committed keys from the transaction's set of locked keys.

>
>>> The first sends the key set to all affected nodes. The key set may not
>>> be accurate, in that some keys are missing, and others are not needed.
>>> But it is the *recipient* of the state transfer which discards unneeded
>>> keys (which it shouldn't own).
>>>
>>
>> What is this key set? Where do we get it from?
>
>
> As the computation of the diff between 2 subsequent views.
>

Sorry, I still don't get this... did you mean the keys that have
changed ownership between the two cache views and this node is the
last owner for (i.e. what we send now)?

>
>
>>> This schem would batch state transfers in a queue: if we have more than
>>> 1 ST, we combine them. E.g. if we have view V5,V6,V7 in the queue,  then
>>> we initiate a state transfer that transfers keys based on the diff
>>> between V5 and V7. If later a V8 is inserted into the queue, we batch
>>> the diffs between V7 and V8.
>>>
>>
>> We already do queue state transfers if all the view changes are joins.
>> But for leaves and merges we interrupt the current state transfer and
>> restart it "from the top".
>>
>> Say we have V5[A,B], V6[A, B, C], V7[A, B, C, D] queued. We do a
>> single state transfer, V5->V7. But when we receive V8[B, C, D], the
>> blocking design interrupts the state transfer, rolls back to V5, and
>> then starts a new state transfer V5->V8.
>>
>> The non-blocking design I proposed does basically the same thing,
>> except it removes the rollback to V5. So we start a state transfer
>> V5->V7, we receive the V8 with a leaver, and we replace the state
>> transfer in progress with another state transfer V5->V8.
>
>
> OK
>

Cheers
Dan