[infinispan-dev] Design session today

Fri Aug 9 08:55:16 EDT 2013

On 9 Aug 2013, at 12:06, Sanne Grinovero <sanne at infinispan.org> wrote:

> On 9 August 2013 11:35, Manik Surtani <msurtani at redhat.com> wrote:
>> We should actually move all of this to infinispan-dev - cc'ing infinispan-dev on my response.
>> 
>> On 9 Aug 2013, at 11:19, Mircea Markus <mmarkus at redhat.com> wrote:
>> 
>>> Hi,
>>> 
>>> I've been giving some thought last evening and here are some second-day thoughts:
>>> 
>>> 1. parallel processing is a great idea and I think its really something that would make a difference compared to our competition
>> 
>> +1.  We should consider the JDK 8 collections APIs as a reference, as I mentioned.
>> 
>>> 
>>> 2. using a two interfaces(CacheLoader,CacheWriter) over one. I'm still not totally bought by the idea
>>> Pros: cleaner design (interface segregation principle[1]) which would allow users to only implement what they need
>>> Cons: the difference between cache loader and cache store (or writer) has been a source of confusion through the users, as most of the users[2] only use the combined version
>>> I'll continue the discussion on the public list
>> 
>> Also, JSR 107 (and from that, most other data grids) will also follow a separate CacheLoader/CacheWriter.  I think people will get used to the separation of interfaces.
> 
> +1 to separate them, especially as people will get familiar with the
> JSR terminology.
> Not going to dig in the mailing lists, but I'm quite sure we already
> discussed this in the past, and AFAIR everyone was agreeing on this.

The only argument I found was that's the way the spec does it :-)
I think CacheLoader + CacheWriter is a nice OOP design, but is rather theoretical. In all external users scenarios I know, the interaction with the store is read+write so most of the people thing about a store along this lines. Having a distinction between loads and stores seems unnatural and creates confusion. For the few that only need a loader they can simply leave the store() empty - as simple as that. 

> 
>> 
>>> 
>>> 3. allowing the cache loader to expose unserialised data directly (ValueHolder.getBytes[]).
>> 
>> I used the name ValueHolder but this is a really poor term - how about ContentsProxy?  It is a proxy for the contents of the entry, exposing methods:
>> 
>> interface ContentsProxy {
>>  ByteBuffer getValueBuffer();
>>  ByteBuffer getInternalCacheValueBuffer();
>>  InternalCacheValue getInternalCacheValue();
>> 
>>  // Same as above except this method only deserializes timestamps and metadata.  Not the actual value.
>>  InternalCacheValue getSparseInternalCacheValue();
>> }
> 
> +1 for the name "ContentsProxy".

currently the CacheLoaderTask looks like:

interface CacheLoaderTask {
  boolean process(Object key, Object ContentProxy p);
}

This assumes that the "key" is already deserialized.
Would be better to have:

interface CacheLoaderTask {
  boolean process(StoredEntry p);
}

with:

interace StoredEntry {
   getKey();
   getValue(); 
}

and then BinaryStoredEntry extends StoredEntry. 
wdyt?

> On exposing ByteBuffer .. I'm wondering if we shouldn't have our own
> interface, potentially shared with JGroups. For example, I'd like to
> see an appendToEnd capability and an insertToHead which don't require
> to eagerly resize the underlying arrays.
> 
>> 
>>> The use cases we had for this are:
>>> a) streaming data during rolling upgrades. This works for scenarios where the data format (user classes) haven't changed and the data is written directly to a persistent store in the destination cluster
>>> b) backups. This can be a generic and efficient (no serialisation) way of creating a backup tool.
>> 
>> There are two more:
>> c) Pre-populating a cache store from an external resource.
>> d) Exposing the underlying byte buffers directly for placement into, say, a native data container or directly onto the network stack for transmission (once JGroups has moved to JDK 7).
>> 
>>> I haven't thought a) entirely, but seems to me that only applies in to a rather specific rolling upgrade scenario.
>>> Re: b) there might be some mere efficient ways of backing up data: take a database dump(jdbc cache store), copy the files (file cache store) etc. Also I'm not sure that the speed with which you take the dump is critical - i.e. even if you serialise/deserialize data might just work.
>> 
>> It's not just the performance hit we take on serialisation/de-serialisation, but also the additional CPU load we place on the system which should be running, performing transactions!
>> 
>>> Also in order to solve a) and b) I don't think ValueHolder.getBytes[] is the way to go. E.g. for the bucket cache stores use as read(and serialisation) unit an entire bucket so forcing them to returns the byte on an per entry basis would mean:
>>> - read the bucket as byte[]
>>> - deserialize the bucket structure
>>> - iterate over entries in the bucket and serialise them again in order to satisfy ValueHolder.getBytes[]
>> 
>> That's just the way buckets are currently designed.  If, for example, each bucket has a header with a structure that looks like: [key1][position1][key2][position2][end-of-keys marker][value1][value2], then just by reading the header part of the bucket, we can grab chunks based on the position information for the values without deserializing them.  Of course, this is an "efficient" implementation.  A naive one could do what you said above and still comply with the contract.
>> 
>>> A better approach for this is to have toStream and fromStream methods similar to what we currently have CacheStore, so that the whole marshalling/unmarshalling business is delegated to the CacheStore itself. Also now that we're here, the CacheStore.toStream/fromStream API was added with the intention of solving the same problem some 4 years ago and are not used at all at this stage, though implemented by all the existing store.
>> 
>> Yes, but I think we can do better than the toStream/fromStream API.
>> 
>>> Bottom line for 3:  I think this is a case in which we should stick to the "if you're not sure don't add it" rule. We can always add it later: a new interface StreamableCacheLoader to extend CacheLoder.
>> 
>> Not always true, since in this case, the API we choose may dictate storage format on disk, which in turn will become a compatibility issue when reading data written using an older version of the same cache loader.
> 
> +1 The purpose is to design the SPI contract to be future proof,

I don't fully agree on the future proof aspect. In 2009 we added CacheStore.toStream/fromStream which was supposed to solve some problems but has never been used (but always implemented!). And of course the fromStream/toStream  methods might be useful, but making use of them has never been a priority.

Besides resolving the immediate problems, the SPI should support solving future problems (such as 3) as well. And it does by allowing the CacheProxy/StoredEntry to be extensible. What I want to avoid though is forcing people implementing SPI methods that we won't use in the foreseeable future it was the case with CacheStore.toStream/fromStream. I don't think you'd be totally surprised if the nice memory-mapped file suggestion you came up with will only be implemented in ISPN 8 or 9, time at which the cache store SPI has changed several times because of other reasons :-) And of course if we'll need it in ISPN 6.1 we can simply extends the CacheProxy/StoredEntry.

> intentionally avoiding to put constraints on how a CacheStore needs to
> implement this. By giving more flexibility, we can then run all
> experiments you want to figure out how this could be more effectively
> implemented, but still providing a stable integration contract.

If anything I want to pull less constraints on the implementors: i.e. not to add into the into the SPI *at this stage* the functionality defined by the BinaryStoredEntry (or wherever the binary methods will be end up being).

> 
> The current contract is preventing experimentation with *potentially*
> more interesting CacheStore designs. I think the goal now is to
> address the limiting aspect of the SPI, but not necessarily to suggest
> how they should be implemented.

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)