[infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei dan.berindei at gmail.com
Thu Mar 22 10:00:39 EDT 2012


On Thu, Mar 22, 2012 at 11:26 AM, Bela Ban <bban at redhat.com> wrote:
> Thanks Dan,
>
> here are some comments / questions:
>
> "2. Each cache member receives the ownership information from the
> coordinator and starts rejecting all commands started in the old cache view"
>
> How do you know a command was started in the old cache view;  does this
> mean you're shipping a cache view ID with every request ?
>

Yes, the plan is to ship the cache view id with every command - I
mentioned in the "Command Execution During State Transfer" section
that we set the view id in the command, I should make it clear that we
also ship it to remote nodes.

>
>
> "2.1. Commands with the new cache view id will be blocked until we have
> installed the new CH and we have received lock information from the
> previous owners"
>
> Doesn't this make this design *blocking* again ? Or do you queue
> requests with the new view-ID, return immediately and apply them when
> the new view-id is installed ? If the latter is the case, what do you
> return ? An OK (how do you know the request will apply OK) ?
>

You're right, it will be blocking. However, the in-flight transaction
data should be much smaller than the entire state, so I'm hoping that
this blocking phase won't be as painful as it is now.
Furthermore, commands waiting for a lock for for a sync RPC won't
block state transfer anymore, so deadlocks between writes and state
transfer (that now end with the write command timing out) should be
impossible.

I am not planning for any queuing at this point - we could only queue
if we had the same order between the writes on all the nodes,
otherwise we would get inconsistencies.
For async commands we will have an implicit queue in the JGroups
thread pool, but we won't have anything for sync commands.

>
>
> "A merge can be coalesced with multiple state transfers, one running in
> each partition. So in the general case a coalesced state transfer
> contain a tree of cache view changes."
>
> Hmm, this can make a state transfer message quite large. Are we trimming
> the modification list ? E.g if we have 10 PUTs on K, 1 removal of K, and
> another 4 PUTs, do we just send the *last* PUT, or do we send a
> modification list of 15 ?
>

Yep, my idea was to keep only a set of keys that have been modified
for each cache view id. The second put to the same key wouldn't modify
the set.
When we transfer the entries to the new owners, we iterate the data
container and only send each key once.

Before writing the key to this set (and the entry to the data
container), the primary owner will synchronously invalidate the key
not only on the L1 requestors, but on all the owners in the previous
cache views (that are no longer owners in the latest view). This will
ensure that the new owner will only receive one copy of each key and
won't have to buffer and sort the state received from all the previous
owners.

>
>
> "Get commands can write to the data container as well when L1 is
> enabled, so we can't block just write commands."
>
> Another downside of the L1 cache being part of the regular cache. IMO it
> would be much better to separate the 2, as I wrote in previous emails
> yesterday.
>

I haven't read those yet, but I'm not sure moving the L1 cache to
another container would eliminate the problem completely.
It's probably not clear from the document, but if we empty the L1 on
cache view changes then we have to ensure that we don't write to L1
anything from an old owner. Otherwise we're left with a value in L1
that none of the new owners knows about and won't invalidate on a put
to that key.

>
> "The new owners will start receiving commands before they have received
> all the state
>     * In order to handle those commands, the new owners will have to
> get the values from the old owners
>     * We know that the owners in the later views have a newer version
> of the entry (if they have it at all). So we need to go back on the
> cache views tree and ask all the nodes on one level at the same time -
> if we don't get any certain anwer we go to the next level and repeat."
>
>
> How does a member C know that it *will* receive any state at all ? E.g.
> if we had key K on A and B, and now B crashed, then A would push a copy
> of K to C.
>
> So when C receives a request R from another member E, but hasn't
> received a state transfer from A yet, how does it know whether to apply
> or queue R ? Does C wait until it get an END-OF-ST message from the
> coordinator ?
>

Yep, each node will signal to the coordinator that it finished pushing
data and when the coordinator gets the confirmation from all the nodes
it will broadcast the end-of-ST message to everyone.

>
> (skipping the rest due to exhaustion by complexity :-))
>
>
> Over the last couple of days, I've exchanged a couple of emails with the
> Cloud-TM guys, and I'm more and more convinced that their total order
> solution is the simpler approach to (1) transactional updates and (2)
> state transfer. They don't have a solution for (2) yet, but I believe
> this can be done as another totally ordered transaction, applied in the
> correct location within the update stream. Or, we could possibly use a
> flush: as we don't need to wait for pending TXs to complete and release
> their locks, this should be quite fast.
>

Sounds intriguing, but I'm not sure how making the entire state
transfer a single transaction would allow us to handle transactions
while that state is being transferred.
Blocking state transfer works with 2PC already. Well, at least as long
as we don't have any merges...

>
> So my 5 cents:
>
> #1 We should focus on the total order approach, and get rid of the 2PC
> and locking business for transactional updates
>

The biggest problem I remember total order having is TM transactions
that have other participants (as opposed to cache-only transactions).
I haven't followed the TO discussion on the mailing list very closely,
does that work now?

Regarding state transfer in particular, remember, non-blocking state
transfer with 2PC sounded very easy as well, before we got into the
details.
The proof-of-concept you had running on the 4.2 branch was much
simpler than what we have now, and even what we have now doesn't cover
everything.

> #2 Really focus on the eventual consistency approach
>

Again, it's very tempting, but I fear much of what makes eventual
consistency tempting is that we don't know enough of its
implementation details yet.
Manik has been investigating eventual consistency for a while, I
wonder what he thinks...


Cheers
Dan



More information about the infinispan-dev mailing list