On Thu, Mar 15, 2012 at 10:09 AM, Bela Ban <bban(a)redhat.com> wrote:
Sorry for the delay, been busy ...
No worries Bela, I'm late with my rewrite of the NBST document as well...
Comments inline.
On 3/9/12 7:20 PM, Dan Berindei wrote:
> On Mar 9, 2012 4:19 PM, "Bela Ban"<bban(a)redhat.com> wrote:
>
>> My understanding (based on my changed in 4.2) is that state transfer
>> moves/deletes keys based on the diff between 2 subsequent views:
>> - Each node checks all of the affected keys
>> - If a key should be stored in additional nodes, the key is pushed there
>> - If a key shouldn't be stored locally anymore, it is removed
>>
>
>
> That's fine if we block all writes during state transfer, but once we start
> allowing writes during state transfer we need to log all changes and send
> them to the new owners at the end (the approach in 4.2 without your
> changes) or redirect all commands to the new owners.
OK
> In addition to that, we have to either block all commands on the new owners
> until they receive the entire state or to forward get commands to the old
> owners as well. The two options apply for lock commands as well.
Why do we have to lock at all if we use queueing of requests during a
state transfer ?
I'm not sure what you mean by queuing of requests...
The current approach:
* If the cache is sync, the old owners will throw a
StateTransferInProgressException when a state transfer is running and
the originator will then wait for state transfer to end before sending
the command again to the new owners. I'm
* If the cache is async, we block write commands on the target, and
unblock them if after the state transfer finished. I don't think we do
anything to make sure all the new owners get all the commands that
were targeted to the old owners, so we can lose consistency between
owners.
The new design will always reject commands with the old cache view id,
on any owner. If the cache view id is ok but we don't have the new CH
or the lock information yet, the command will block on the target.
* If the cache is sync, we can use the same approach as before and
tell the originator to retry the command on the new owners.
* If the cache is async, instead of doing nothing like before I was
thinking that the target itself could forward the commands (especially
write commands) to the new owners.
We didn't discuss this in London, I added it afterwards as I realized
that without forwarding we'll be losing data. I didn't realize at the
time, however, that the current design has the same problem.
I think I see a problem with my request forwarding plan: the requests
will have a different source, so JGroups will not ensure an ordering
between them and the originator's subsequent requests. This means
we'll have inconsistencies anyway, so perhaps it would be better if we
stuck to the current design's limitations and remove the requirement
for old targets to forward commands to new ones.
> I'm not trying to make merges more complicated on purpose :)
I didn't imply that; but I thought the London design was pretty simple,
and I'm trying to figure out why we have such a big (maybe perceived)
diff between London and what's in the document.
True, there are a many complications that were not apparent during our
London discussion.
>> Also, why do we need to transfer ownership information ? Can't ownership
>> be calculated purely on local information ?
>>
>
>
> The current ownership information can be calculated based solely on the
> members list. But the ownership in the previous cache view(s) can not be
> computed by joiners based only on their information, so it has to be
> broadcasted by the coordinator.
OK
> One particularly nasty problem with the existing, blocking, state transfer
> is that before iterating the data container we need to wait for all the
> pending commands to finish.
Can't we queue the state transfer requests *and* the regular requests ?
When ST is done, we apply the ST requests first, then the queued regular
requests.
That was basically what we did in the blocking design: the ST commands
could execute during ST, but regular commands would block until the
end of the ST. With async caches, that meant we would use JGroups' 1
queue per sender (so not a global queue, but close).
The problem was not with the regular commands that arrived after the
start of the ST, but with the commands that had already started
executing when ST started. This is the classic example:
1. A prepare command for Tx1 locks k1 on node A
2. A prepare command for Tx2 tries to acquire lock k1 on node A
3. State transfer starts up and blocks all write commands
4. The Tx1 commit command, which will unlock k1, arrives but can't run
until state transfer has ended
5. The Tx2 prepare command times out on the lock acquisition after 10
seconds (by default)
6. State transfer can can now proceed and push or receive data.
7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however.
The solution I had in mind for the old design was to add some kind of
deadlock detection to the LockManager and throw a
StateTransferInProgress when a deadlock with the state transfer is
detected.
With the new design I thought it would be simpler to not acquire a big
lock for the entire duration of the write command that would prevent
state transfer. Instead I would acquire different locks for much
shorter amounts of time, and at the beginning of each lock acquisition
we would just check that the command's view id is still the correct
one.
I guess I may have been wrong on the "simpler" assertion :)