[infinispan-dev] Non-blocking state transfer (ISPN-1424)

Fri Mar 9 14:31:14 EST 2012

On Fri, Mar 9, 2012 at 6:03 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
> I agree with Bela: this looks scary, can't imagine how tricky it would
> be to implement it correctly. Could you split the problem?
>

I thought I just did that with the different sections :)

All the state transfer components are bound together by the approach
we choose for updating the CH, so and I think most of the
complications stem from the way we have to maintain ownership
information - not just for the current cache view, but for the
previous cache views as well.

> Also, what happened with the super-simple ideas I'd shared in London?
> Is is the same? I'm assuming it's very different..
> I might have overseen some important aspect but I'd like to know why
> that approach was discarded. Was it looking too simple ? :P
>

I thought it was the same thing, but in London we only had a very
high-level discussion. We didn't discuss how a joiner gets the chain
of CHs so that it knows which are the old owners, and we only briefly
touched on how to discard old cache views. Please let me know where
you think I've diverged from our discussions :)

> I'm sorry I'm skimming through it, will have more time next week but
> as Galder said as well I'll need to draw this to understand it better.
>
> Some first-impact notes:
> # ownership information
>  Deciding when to start a state transfer, is a different problem.
> move it to another page or drop it?
>  Deciding when to have a node join - same as above?
>

We do need to decide whether we want to allow CH updates without
transferring the actual data or not - it has an impact on how big the
"transaction log" on the new owners can get.

We also need to decide whether we allow state transfer to finish
successfully even though one node has already left the cache view.
I've chosen to "interrupt" the state transfer on a leave in order to
avoid weird situations where DM.getLocation(key) returns an empty set,
but that complicates things a bit so we need to make it explicit here.
E.g. after an error we can't just retry the same cache view, we have
to account for leavers as well, so we can't say that a merge can only
be the last view in the chain).

> # Cache entries
> ##  "We need a snapshot of the iterator" < Can we avoid it? We just
> start refusing to serve write commands by checking any incoming
> command. It's an additional interceptor which checks the incoming
> command is "appropriate" to be handled by us or needs to return some
> kind of rejection code.

We have to iterate over the data container in order to push data, and
that iterator behaves as a snapshot. So there's nothing to avoid here.
I did mention that we start refusing write commands - the iteration of
the data container is the reason why we need the cache view id check.

> ## Need for tombstones < I think we can avoid that actually. We'll
> need it for MVCC replace/delete operations, but for state transfer
> it's not needed if we decide that a Write operation has to send the
> value to the new owners only and send an "authoritative invalidation"
> to all previous owners.
>

I'm not sure how this "authoritative invalidation" would work... any
change to the data container may or may be reflected in the iteration
done by the state transfer task, and actually by the time we send the
invalidation command the entry could in fact be in a JGroups message
on its way to the new owner.

> # Lock Information
> This is trivial if you stop thinking as them as being special. A lock
> is a marker, and a marker is a value stored in the grid. These values
> are transferred as any other value, with one single differentiator:
> since there always is only one, they are manipulated via CAS
> operations and are guaranteed to be consistent without the need of
> being locked when changed.
>

Actually no, we only keep the lock on the primary owner so we can't
rely on the lock information being there in case of a leave. So we
need to transfer transaction information instead - and the transaction
information only gives a "possibly locked" status for a key.

Even if we do treat the transaction information as the normal state,
it doesn't mean that we don't have to handle it - in fact I wrote this
idea and marked it [LATER] because I thought it would be too
complicated to duplicate the data infrastructure for transaction
information.

> #L1
> let's keep it simple initially and just flush them out as decided.

Agree, it's simpler to flush everything out - but that doesn't mean we
don't have to change anything, we still need to add the old owners as
requestors if L1OnRehash is enabled.

> ## the cleanup you mention: is that not a current bug, orthogonal to
> this design page? (trying to identify more thing to move out)
>

I could move out the fact that it's a current bug, but it's still
something we need to do during state transfer and we have to decide
what locks/latches  we need to hold while doing it in order to ensure
consistency.

> #Handling merges
> Could we simplify this by saying that the views are not actually a
> linked list but a tree?

As I wrote the document I wasn't quite sure that the tree approach
would work, but I'm more and more convinced that it would be fine.
I agree that it sounds a little more complicated because I wrote it at
first considering a list of cache views and I only started considering
the generic tree approach when I wrote the merge section. But I
couldn't change it as I wasn't 100% sold on the tree approach (another
idea I had was to ignore any view change after a merge, so the cache
view tree could only have more than 1 branch at the root level), so I
try to leave the option open hoping for better suggestions from you
guys.

> In this document we're not attempting to solve consistent merging of
> split brain, right? So we only need to know how to move the state to
> the rightful new owner. For conflicts, let's assume there is an
> "ConflictResolver object" which we'll describe/implement somewhere
> else.
>

I don't want to solve consistent merging here, but I do want to make
sure that it is possible. For instance if after a failed merge the
information about the old partitions is lost, no "ConflictResolver
object" could make that state consistent and state will be just lost.

The long discussion about which view is newer is there because without
merges we want to ensure consistency and apply received entries in
their logical order (which happens the ascending order of their cache
view id) - therefore we need to buffer data received for
"intermediate" nodes in the cache view chain. Merges complicate this
because the cache view tree has multiple leaves and we have no way of
ordering them (unlike JGroups we identify a cache view only by its id,
so there's no way compare cache views and decide one is a descendent
of another - hence the need for the intermediate flag.

> #State transfer disabled
> We should think about the cases in which this option makes sense to be
> enabled. In those cases, would people still be interested in L1
> consistency and transactions? If not, this is not a problem to solve.
>

The main use case I have in mind is when the user doesn't care about
missing data (we're not the authoritative source) but he does care
about staleness. Most users I talked with (not many, I grant you that)
are willing to accept stale data but on very short timescales. L1,
with its 10 minutes default lifespan, doesn't qualify - and making the
L1 lifespan very short will make it useless.

> after getting to the end, it's not a bad document at all but I still
> think it looks too scary :D
>

Agree with both ;-)

Cheers
Dan

> Cheers,
> Sanne
>
> On 9 March 2012 14:19, Bela Ban <bban at redhat.com> wrote:
>> Wow !
>>
>> Does this need to be so complex ? I've spent a hour trying to understand
>> it, and am still overwhelmed... :-)
>>
>> My understanding (based on my changed in 4.2) is that state transfer
>> moves/deletes keys based on the diff between 2 subsequent views:
>> - Each node checks all of the affected keys
>> - If a key should be stored in additional nodes, the key is pushed there
>> - If a key shouldn't be stored locally anymore, it is removed
>>
>> IMO, there's no need to handle a merge differently from a regular view,
>> and we might end up with inconsistent state, but that's unavoidable
>> until we have eventual consistency. Fine...
>>
>> Also, why do we need to transfer ownership information ? Can't ownership
>> be calculated purely on local information ?
>>
>> I'm afraid that the complexity will increase the state space (hard to
>> test all possible state transitions), lead to unnecessary messages being
>> sent and most importantly, might lead to blocks.
>>
>> The section on locking outright scares me :-) Perhaps reducing the level
>> of details here - as Galder suggested - might help to understand the
>> basic design.
>>
>> Sorry for being a bit negative, but I think state transfer is one of the
>> most critical and important pieces of code in DIST mode, and this needs
>> to work against large (say a couple of hundreds) clusters and nodes
>> joining, leaving or crashing all the times...
>>
>> I'm going to re-read the design again, maybe what I said above is just
>> BS ... :-)
>>
>>
>> On 3/8/12 11:55 AM, Dan Berindei wrote:
>>> Hi guys
>>>
>>> It's been a long time coming, but I finally published the non-blocking
>>> state transfer draft on the wiki:
>>> https://community.jboss.org/wiki/Non-blockingStateTransfer
>>>
>>> Unlike my previous state transfer design document, I think I've
>>> fleshed out most of the implications. Still, there are some things I
>>> don't have a clear solution for yet. As you would expect it's mostly
>>> around merging and delayed state transfer.
>>>
>>> I'm looking forward to hearing your comments/advice!
>>>
>>> Cheers
>>> Dan
>>>
>>> PS: Let's discuss this over the mailing list only.
>>>
>> --
>> Bela Ban, JGroups lead (http://www.jgroups.org)
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev