[infinispan-dev] Shared vs Non-Shared CacheStores

Thu Jul 16 09:32:11 EDT 2015

I would like to propose a clear cut separation between our shared and
non-shared CacheStores,
in all terms such as:
 - Configuration options
 - Integration contracts (Split the CacheStore SPI)
 - Implementations
 - Terminology, to avoid any further confusion around valid
configurations and sensible architectures

We have loads of examples of users who get in trouble by configuring
one incorrectly, but also there are plenty of efficiency improvements
we could take advantage of by clearly splitting the integration points
and the implementations in two categories.

Not least, it's a very common and dangerous pitfall to assume that
Infinispan is able to restore a consistent state after having stopped
a DIST cluster which passivated into non-shared CacheStore instances,
or even REPL clusters when they don't shutdown all at the same exact
time (and "exact same time" is a strange concept at least..). We need
to clarify the different options, tradeoffs and their consequences..
to users and ourselves, as a clearly defined use case will avoid bugs
and simplify implementations.

# The purpose of each
I think that people should use a non-shared (local?) CacheStore for
the sole purpose of expanding to storage capacity of each single
node.. be it because you don't have enough memory at all, or be it
because you prefer some extra safety margin because either your
estimates are complex, or maybe because we live in a real world were
the hashing function might not be perfect in practice. I hope we all
agree that Infinispan should be able to take such situations with at
worst a graceful performance degradatation, rather than complain
sending OOMs to the admin and setting the service on strike.

A Shared CacheStore is useful for very different purposes; primarily
to implement a Cache on some other service - for example your (single,
shared) RDBMs, a slow (or expensive) webservice your organization has
to call frequently, etc.. Or it's useful even as a write-through cache
on a similar service, maybe internal but not able to handle the high
variation of load spikes which Infinsipan can handle better.
Finally, a great use case is to have a consistent backup of all your
data-grid content, possibly in some "reference" form such as JPA
mapped entities.

# Benefits of a Non-Shared
A non-shared CacheStore implementor should be able to take advantage
of *its purpose*, among the big ones I see:
 - Exclusive usage -> locking of a specific entry can be handled at
datacontainer level, can simplify quite some internal code.
 - Reliability -> since a clustered node needs to wipe its state at
reboot (after a crash), it's much simpler to code any such CacheStore
to avoid any form of disk synch or persistance guarantees.
 - Encoding format -> this can be controlled entirely by Infinispan,
and no need to take factors like rolling upgrade compatible encodings
in mind. JBoss Marshalling would be good enough, or some
implementations might not need to serialize at all.

Our non-shared CacheStore implentation(s) could take advantage of
lower level more complex code optimisations and interfaces, as users
would rarely want to customize one of these, while the use case of
mapping data to a shared service needs a more user friendly SPI so to
keep it simple to plug in custom stores: custom data formats, custom
connectors, get some help in implementing concurrency correctly.
Proper Transaction integration for the CacheStore has been on our
wishlist for some time too, I suspect that accepting that we have been
mixing up two different things under a same name so far, would make it
simpler to implement further improvements such as transactions: the
way to do such a thing is very different in each of these use cases,
so it would help at least to implement it on a subset first, or maybe
only if it turns out there's no need for such things in the context of
the local-only-dedicated "swapfile".

# Mixed types should be killed
I'm aware that some of our current implementations _could_ work both as
shared or non-shared, for example the JDBC or JPACacheStore or the
Remote Cachestore.. but in most cases it doesn't make much sense. Why
would you ever want to use the JPACacheStore if not to share data with
a _shared_ database?

We should take such options away, and by doing so focus on the use
cases which actually matter and simplify the implementations and
improve the configuration validations.

If ever a compelling storage technology is identified which we'd like to
offer as an option for both shared or non-shared, I would still
recommend to make two different implementations, as there certainly are
different requirements and assumptions when coding such a thing.

Not least, I would very like to see a default local CacheStore:
picking one for local "emergency swapping" should be a no-brainer for
users; we could setup one by default and not bother newcomers with
complex choices.

If we simplify the requirement of such a thing, it should be easy to
write one on standard Java NIO2 APIs and get rid of the complexities of
maintaining the native integration with things like LevelDB, not least
the inefficiency of Java to make such native calls.

Then as a second step, we should attack the other use case: backups;
from a *purpose driven perspective* I'd then see us revive the Cassandra
integration; obviously as a shared-only option.

Cheers,
Sanne