On Mar 9, 2012 4:19 PM, "Bela Ban" <bban@redhat.com> wrote:
>
> Wow !
>
> Does this need to be so complex ? I've spent a hour trying to understand
> it, and am still overwhelmed... :-)
>
Sorry about that Bela, it is quite complex indeed.
> My understanding (based on my changed in 4.2) is that state transfer
> moves/deletes keys based on the diff between 2 subsequent views:
> - Each node checks all of the affected keys
> - If a key should be stored in additional nodes, the key is pushed there
> - If a key shouldn't be stored locally anymore, it is removed
>
That's fine if we block all writes during state transfer, but once we start allowing writes during state transfer we need to log all changes and send them to the new owners at the end (the approach in 4.2 without your changes) or redirect all commands to the new owners.
In addition to that, we have to either block all commands on the new owners until they receive the entire state or to forward get commands to the old owners as well. The two options apply for lock commands as well.
> IMO, there's no need to handle a merge differently from a regular view,
> and we might end up with inconsistent state, but that's unavoidable
> until we have eventual consistency. Fine...
>
I'm not trying to make merges more complicated on purpose :)
I think we need to try our best to prevent data loss, even if we there is a chance of inconsistency. We still see clusters in the test suite form via merges from time to time, so we can't just say after a merge all bets are off.
The problem is that I chose to forward get commands to the old owners AND to remove the cache view rollback (which was blocking in our Lisbon design). This means that we must keep a chain of cache views for which we haven't finished state transfer, and with merges that chain turns into a tree + it has to be broadcasted by the coordinator to all the nodes.
> Also, why do we need to transfer ownership information ? Can't ownership
> be calculated purely on local information ?
>
The current ownership information can be calculated based solely on the members list. But the ownership in the previous cache view(s) can not be computed by joiners based only on their information, so it has to be broadcasted by the coordinator.
> I'm afraid that the complexity will increase the state space (hard to
> test all possible state transitions), lead to unnecessary messages being
> sent and most importantly, might lead to blocks.
>
I agree the increased complexity is a concern, but I'm not willing to give up on non-blocking state transfer just yet...
One particularly nasty problem with the existing, blocking, state transfer is that before iterating the data container we need to wait for all the pending commands to finish. So if we have high contention and a 60 seconds lock acquisition timeout, state transfer is almost guaranteed to take > 60 seconds.
> The section on locking outright scares me :-) Perhaps reducing the level
> of details here - as Galder suggested - might help to understand the
> basic design.
>
I got burned pretty hard with my asymmetric clusters design, because the implementation turned out a lot more complex than the design, so I tried to investigate all the interactions between the different choices we're making this time.
> Sorry for being a bit negative, but I think state transfer is one of the
> most critical and important pieces of code in DIST mode, and this needs
> to work against large (say a couple of hundreds) clusters and nodes
> joining, leaving or crashing all the times...
>
I'd argue that the blocking state transfer we have doesn't satisfy this requirement...
> I'm going to re-read the design again, maybe what I said above is just
> BS ... :-)
>
Please do re-read it, I'll try to simplify it a bit by Monday based on your feedback.
>
> On 3/8/12 11:55 AM, Dan Berindei wrote:
> > Hi guys
> >
> > It's been a long time coming, but I finally published the non-blocking
> > state transfer draft on the wiki:
> > https://community.jboss.org/wiki/Non-blockingStateTransfer
> >
> > Unlike my previous state transfer design document, I think I've
> > fleshed out most of the implications. Still, there are some things I
> > don't have a clear solution for yet. As you would expect it's mostly
> > around merging and delayed state transfer.
> >
> > I'm looking forward to hearing your comments/advice!
> >
> > Cheers
> > Dan
> >
> > PS: Let's discuss this over the mailing list only.
> >
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev