On Tue, Sep 25, 2012 at 10:53 AM, Manik Surtani <manik(a)jboss.org> wrote:
On 24 Sep 2012, at 16:22, Dan Berindei <dan.berindei(a)gmail.com> wrote:
Hi guys
During the final push for NBST I found a bug with preloading (entries that
didn't belong on a joiner weren't removed after the initial state
transfer). I decided to fix it and
https://issues.jboss.org/browse/ISPN-1586 at the same time, since it was
a longstanding bug and I had a reasonable idea on what to do. However, I
missed some implications and I need to fix them - there is at least one
Query test failing because of my change (SharedCacheLoaderQueryIndexTest).
In 5.1, preloading worked like this:
1. Start the CacheLoaderManager, which preloads everything from the cache
store in memory.
2. Start the StateTransferManager, retrieving data from the other cache
members and overwriting already-preloaded values.
3. When the initial state transfer ends, entries not owned by the local
node are deleted.
The main issue with this, raised in ISPN-1586, is that entries that were
deleted on the other cache members are "revived" on the joiner when it
reads the data from the cache store. There is another performance issue,
because we load a lot of data that we then discard, but that's less
important.
With the ISPN-1586 fix, preloading should work like this:
1. Start the StateTransferManager, receive initial CH.
2. If the local node is not the first to start up, fetching state (either
in-memory or persistent) is enabled and the cache store is non-shared,
clear it.
3. Start the CacheLoaderManager, which preloads the cache store in memory
- but only if the local node is the first one having started the cache OR
if the fetching state is disabled.
4. Run the initial state transfer, retrieving data from the other cache
members (if any, and if fetching state is enabled).
This solves ISPN-1586, but it does mean that data from non-shared cache
stores will be lost on all the nodes except the first that starts up. So if
the last node to shut down is not the first node to start back up, the
cluster will lose data.
These are the alternatives I'm considering:
a) Finish the ISPN-1586 fix and clearly document that non-shared cache
stores don't guarantee persistence after cluster restart (unless the last
cache to stop is the first to start back up and shutdown was spaced out to
allow state transfer to move everything to the last node).
b) Revert my ISPN-1586 fix and allow "zombie" cache entries on the joiners
(leaving ISPN-1586 open).
Maybe another approach could be:
1. Start the STM, retrieve initial CH
2. If the local node… (as above) … is non-shared, *don't clear it*, but
mark the node so preloading is *deferred*.
3. Start the CLM … skip preload if we mark it as deferred, in step 2.
4. Run initial state transfer. This will write newer versions of entries
to the cache store if needed.
5. Now, if preloading has been deferred in step 2, start a preload, if
we're configured to do any preloading.
This should give us consistency.
Nope, this doesn't solve ISPN-1586: if the already-running members have
deleted a key, the deferred preload on the joiner can still load that key
from its cache store. In fact, the preload doesn't even matter here: just
the fact that the key is still in the cache store means that the node can
still return a non-null value for a deleted key.
This is why I added the clear step in my algorithm: to avoid resurrecting
removed keys without receiving any tombstones through state transfer.
I think there may be a third option:
c) Make preload a JMX operation and allow the user to run a cluster-wide
preload once all the nodes in the cluster have started up. But this looks a
little complicated, and it would require either versioning or prohibiting
external cache writes until the cluster-wide preload is done to ensure
consistency.
I'm not sure how having this as a JMX option helps. Having versioning,
etc. solves the problem even with an automatic preload.
Agree, just having this as an option in JMX doesn't fix anything. But
having it as a manual operation would allow us to assume (and document it
this way) that the admin only exposes the cluster to the clients after
preloading is done - so we'd have no concurrent changes to worry about.
What do you guys think? Sanne, I'm particularly interested how
you think
option a) would fit with the query module.
Cheers
Dan