On 9 Aug 2013, at 12:06, Sanne Grinovero <sanne(a)infinispan.org> wrote:
On 9 August 2013 11:35, Manik Surtani <msurtani(a)redhat.com>
wrote:
> We should actually move all of this to infinispan-dev - cc'ing infinispan-dev on
my response.
>
> On 9 Aug 2013, at 11:19, Mircea Markus <mmarkus(a)redhat.com> wrote:
>
>> Hi,
>>
>> I've been giving some thought last evening and here are some second-day
thoughts:
>>
>> 1. parallel processing is a great idea and I think its really something that
would make a difference compared to our competition
>
> +1. We should consider the JDK 8 collections APIs as a reference, as I mentioned.
>
>>
>> 2. using a two interfaces(CacheLoader,CacheWriter) over one. I'm still not
totally bought by the idea
>> Pros: cleaner design (interface segregation principle[1]) which would allow users
to only implement what they need
>> Cons: the difference between cache loader and cache store (or writer) has been a
source of confusion through the users, as most of the users[2] only use the combined
version
>> I'll continue the discussion on the public list
>
> Also, JSR 107 (and from that, most other data grids) will also follow a separate
CacheLoader/CacheWriter. I think people will get used to the separation of interfaces.
+1 to separate them, especially as people will get familiar with the
JSR terminology.
Not going to dig in the mailing lists, but I'm quite sure we already
discussed this in the past, and AFAIR everyone was agreeing on this.
The only argument I found was that's the way the spec does it :-)
I think CacheLoader + CacheWriter is a nice OOP design, but is rather theoretical. In all
external users scenarios I know, the interaction with the store is read+write so most of
the people thing about a store along this lines. Having a distinction between loads and
stores seems unnatural and creates confusion. For the few that only need a loader they can
simply leave the store() empty - as simple as that.
>
>>
>> 3. allowing the cache loader to expose unserialised data directly
(ValueHolder.getBytes[]).
>
> I used the name ValueHolder but this is a really poor term - how about ContentsProxy?
It is a proxy for the contents of the entry, exposing methods:
>
> interface ContentsProxy {
> ByteBuffer getValueBuffer();
> ByteBuffer getInternalCacheValueBuffer();
> InternalCacheValue getInternalCacheValue();
>
> // Same as above except this method only deserializes timestamps and metadata. Not
the actual value.
> InternalCacheValue getSparseInternalCacheValue();
> }
+1 for the name "ContentsProxy".
currently the CacheLoaderTask looks like:
interface CacheLoaderTask {
boolean process(Object key, Object ContentProxy p);
}
This assumes that the "key" is already deserialized.
Would be better to have:
interface CacheLoaderTask {
boolean process(StoredEntry p);
}
with:
interace StoredEntry {
getKey();
getValue();
}
and then BinaryStoredEntry extends StoredEntry.
wdyt?
On exposing ByteBuffer .. I'm wondering if we shouldn't have
our own
interface, potentially shared with JGroups. For example, I'd like to
see an appendToEnd capability and an insertToHead which don't require
to eagerly resize the underlying arrays.
>
>> The use cases we had for this are:
>> a) streaming data during rolling upgrades. This works for scenarios where the
data format (user classes) haven't changed and the data is written directly to a
persistent store in the destination cluster
>> b) backups. This can be a generic and efficient (no serialisation) way of
creating a backup tool.
>
> There are two more:
> c) Pre-populating a cache store from an external resource.
> d) Exposing the underlying byte buffers directly for placement into, say, a native
data container or directly onto the network stack for transmission (once JGroups has moved
to JDK 7).
>
>> I haven't thought a) entirely, but seems to me that only applies in to a
rather specific rolling upgrade scenario.
>> Re: b) there might be some mere efficient ways of backing up data: take a
database dump(jdbc cache store), copy the files (file cache store) etc. Also I'm not
sure that the speed with which you take the dump is critical - i.e. even if you
serialise/deserialize data might just work.
>
> It's not just the performance hit we take on serialisation/de-serialisation, but
also the additional CPU load we place on the system which should be running, performing
transactions!
>
>> Also in order to solve a) and b) I don't think ValueHolder.getBytes[] is the
way to go. E.g. for the bucket cache stores use as read(and serialisation) unit an entire
bucket so forcing them to returns the byte on an per entry basis would mean:
>> - read the bucket as byte[]
>> - deserialize the bucket structure
>> - iterate over entries in the bucket and serialise them again in order to satisfy
ValueHolder.getBytes[]
>
> That's just the way buckets are currently designed. If, for example, each bucket
has a header with a structure that looks like:
[key1][position1][key2][position2][end-of-keys marker][value1][value2], then just by
reading the header part of the bucket, we can grab chunks based on the position
information for the values without deserializing them. Of course, this is an
"efficient" implementation. A naive one could do what you said above and still
comply with the contract.
>
>> A better approach for this is to have toStream and fromStream methods similar to
what we currently have CacheStore, so that the whole marshalling/unmarshalling business is
delegated to the CacheStore itself. Also now that we're here, the
CacheStore.toStream/fromStream API was added with the intention of solving the same
problem some 4 years ago and are not used at all at this stage, though implemented by all
the existing store.
>
> Yes, but I think we can do better than the toStream/fromStream API.
>
>> Bottom line for 3: I think this is a case in which we should stick to the
"if you're not sure don't add it" rule. We can always add it later: a
new interface StreamableCacheLoader to extend CacheLoder.
>
> Not always true, since in this case, the API we choose may dictate storage format on
disk, which in turn will become a compatibility issue when reading data written using an
older version of the same cache loader.
+1 The purpose is to design the SPI contract to be future proof,
I don't fully agree on the future proof aspect. In 2009 we added
CacheStore.toStream/fromStream which was supposed to solve some problems but has never
been used (but always implemented!). And of course the fromStream/toStream methods might
be useful, but making use of them has never been a priority.
Besides resolving the immediate problems, the SPI should support solving future problems
(such as 3) as well. And it does by allowing the CacheProxy/StoredEntry to be extensible.
What I want to avoid though is forcing people implementing SPI methods that we won't
use in the foreseeable future it was the case with CacheStore.toStream/fromStream. I
don't think you'd be totally surprised if the nice memory-mapped file suggestion
you came up with will only be implemented in ISPN 8 or 9, time at which the cache store
SPI has changed several times because of other reasons :-) And of course if we'll need
it in ISPN 6.1 we can simply extends the CacheProxy/StoredEntry.
intentionally avoiding to put constraints on how a CacheStore needs
to
implement this. By giving more flexibility, we can then run all
experiments you want to figure out how this could be more effectively
implemented, but still providing a stable integration contract.
If anything I want to pull less constraints on the implementors: i.e. not to add into the
into the SPI *at this stage* the functionality defined by the BinaryStoredEntry (or
wherever the binary methods will be end up being).
The current contract is preventing experimentation with *potentially*
more interesting CacheStore designs. I think the goal now is to
address the limiting aspect of the SPI, but not necessarily to suggest
how they should be implemented.
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)