On Mar 9, 2012 4:19 PM, "Bela Ban" <bban(a)redhat.com> wrote:
Wow !
Does this need to be so complex ? I've spent a hour trying to understand
it, and am still overwhelmed... :-)
Sorry about that Bela, it is quite complex indeed.
My understanding (based on my changed in 4.2) is that state transfer
moves/deletes keys based on the diff between 2 subsequent views:
- Each node checks all of the affected keys
- If a key should be stored in additional nodes, the key is pushed there
- If a key shouldn't be stored locally anymore, it is removed
That's fine if we block all writes during state transfer, but once we start
allowing writes during state transfer we need to log all changes and send
them to the new owners at the end (the approach in 4.2 without your
changes) or redirect all commands to the new owners.
In addition to that, we have to either block all commands on the new owners
until they receive the entire state or to forward get commands to the old
owners as well. The two options apply for lock commands as well.
IMO, there's no need to handle a merge differently from a regular
view,
and we might end up with inconsistent state, but that's unavoidable
until we have eventual consistency. Fine...
I'm not trying to make merges more complicated on purpose :)
I think we need to try our best to prevent data loss, even if we there is a
chance of inconsistency. We still see clusters in the test suite form via
merges from time to time, so we can't just say after a merge all bets are
off.
The problem is that I chose to forward get commands to the old owners AND
to remove the cache view rollback (which was blocking in our Lisbon
design). This means that we must keep a chain of cache views for which we
haven't finished state transfer, and with merges that chain turns into a
tree + it has to be broadcasted by the coordinator to all the nodes.
Also, why do we need to transfer ownership information ? Can't
ownership
be calculated purely on local information ?
The current ownership information can be calculated based solely on the
members list. But the ownership in the previous cache view(s) can not be
computed by joiners based only on their information, so it has to be
broadcasted by the coordinator.
I'm afraid that the complexity will increase the state space
(hard to
test all possible state transitions), lead to unnecessary messages being
sent and most importantly, might lead to blocks.
I agree the increased complexity is a concern, but I'm not willing to give
up on non-blocking state transfer just yet...
One particularly nasty problem with the existing, blocking, state transfer
is that before iterating the data container we need to wait for all the
pending commands to finish. So if we have high contention and a 60 seconds
lock acquisition timeout, state transfer is almost guaranteed to take > 60
seconds.
The section on locking outright scares me :-) Perhaps reducing the
level
of details here - as Galder suggested - might help to understand the
basic design.
I got burned pretty hard with my asymmetric clusters design, because the
implementation turned out a lot more complex than the design, so I tried to
investigate all the interactions between the different choices we're making
this time.
Sorry for being a bit negative, but I think state transfer is one of
the
most critical and important pieces of code in DIST mode, and this needs
to work against large (say a couple of hundreds) clusters and nodes
joining, leaving or crashing all the times...
I'd argue that the blocking state transfer we have doesn't satisfy this
requirement...
I'm going to re-read the design again, maybe what I said above is
just
BS ... :-)
Please do re-read it, I'll try to simplify it a bit by Monday based on your
feedback.
On 3/8/12 11:55 AM, Dan Berindei wrote:
> Hi guys
>
> It's been a long time coming, but I finally published the non-blocking
> state transfer draft on the wiki:
>
https://community.jboss.org/wiki/Non-blockingStateTransfer
>
> Unlike my previous state transfer design document, I think I've
> fleshed out most of the implications. Still, there are some things I
> don't have a clear solution for yet. As you would expect it's mostly
> around merging and delayed state transfer.
>
> I'm looking forward to hearing your comments/advice!
>
> Cheers
> Dan
>
> PS: Let's discuss this over the mailing list only.
>
--
Bela Ban, JGroups lead (
http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev