[
https://issues.jboss.org/browse/ISPN-1239?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-1239:
------------------------------------
The workaround caused some failures in ConcurrentOverlappingLeaveTest:
1. We were clearing the data container before stopping the cache
With the workaround I was clearing all the caches before stopping them so the time
interval between the clear and the stop was bigger and a cache with an empty data
container could have time to participate in a rehash.
I removed the data container clearing stage as it's not really necessary - we only
need to clear the cache between test methods, if we don't stop it completely.
2. The time interval between two nodes leaving got a lot smaller, and so it was more
likely that the second leaver would start to push something for the first leaver's
rehash but never got to send it.
Let's say the initial cluster members are {A, B, C, D}, numOwners = 3, and D and C
leave in quick succession.
With initial owners(k) = {B, C, D}, when D leaves B expects C to push the key to A.
If but C dies before pushing it, on the following rehash B doesn't push the key to A.
The solution is to remember the last CH for which rehashing completed successfully and
base every rehash on the last successful CH.
3. When a rehash was interrupted by another view coming in, we would allow waiting
transactions to do some work before starting the next rehash.
This could lead to a deadlock if the transaction needed to replicate synchronously to
another node that is waiting for us to finish the rehash.
Instead the rehash task should leave the transactions blocked if it was interrupted by
another view, because it knows there is another rehash pending.
Graceful shutdown should be supported
-------------------------------------
Key: ISPN-1239
URL:
https://issues.jboss.org/browse/ISPN-1239
Project: Infinispan
Issue Type: Feature Request
Components: Distributed Cache
Affects Versions: 5.0.0.FINAL
Reporter: Manik Surtani
Assignee: Dan Berindei
Priority: Critical
Labels: clean_shutdown, rehashing
Fix For: 5.1.0.BETA1, 5.1.0.FINAL
Currently, killing any node will result in a rehash. A mechanism for clean shutdown
should also be supported, so that a rehash is *not* triggered. Useful when the entire
cluster is being intentionally brought down.
Need to think about how we do this; perhaps a LEAVE message that will prevent nodes
triggering a rehash when a subsequent view change is detected. This could be done
programmatically via a {{clean}} parameter to {{stop()}}, but we should explore
alternatives here.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira