[infinispan-dev] Design session today

Fri Aug 9 06:35:49 EDT 2013

We should actually move all of this to infinispan-dev - cc'ing infinispan-dev on my response.

On 9 Aug 2013, at 11:19, Mircea Markus <mmarkus at redhat.com> wrote:

> Hi,
> 
> I've been giving some thought last evening and here are some second-day thoughts:
> 
> 1. parallel processing is a great idea and I think its really something that would make a difference compared to our competition

+1.  We should consider the JDK 8 collections APIs as a reference, as I mentioned.

> 
> 2. using a two interfaces(CacheLoader,CacheWriter) over one. I'm still not totally bought by the idea
> Pros: cleaner design (interface segregation principle[1]) which would allow users to only implement what they need
> Cons: the difference between cache loader and cache store (or writer) has been a source of confusion through the users, as most of the users[2] only use the combined version
> I'll continue the discussion on the public list

Also, JSR 107 (and from that, most other data grids) will also follow a separate CacheLoader/CacheWriter.  I think people will get used to the separation of interfaces.

> 
> 3. allowing the cache loader to expose unserialised data directly (ValueHolder.getBytes[]).

I used the name ValueHolder but this is a really poor term - how about ContentsProxy?  It is a proxy for the contents of the entry, exposing methods:

interface ContentsProxy {
  ByteBuffer getValueBuffer();
  ByteBuffer getInternalCacheValueBuffer();
  InternalCacheValue getInternalCacheValue();

  // Same as above except this method only deserializes timestamps and metadata.  Not the actual value.
  InternalCacheValue getSparseInternalCacheValue();
}

> The use cases we had for this are: 
> a) streaming data during rolling upgrades. This works for scenarios where the data format (user classes) haven't changed and the data is written directly to a persistent store in the destination cluster
> b) backups. This can be a generic and efficient (no serialisation) way of creating a backup tool. 

There are two more:
c) Pre-populating a cache store from an external resource.
d) Exposing the underlying byte buffers directly for placement into, say, a native data container or directly onto the network stack for transmission (once JGroups has moved to JDK 7).

> I haven't thought a) entirely, but seems to me that only applies in to a rather specific rolling upgrade scenario.
> Re: b) there might be some mere efficient ways of backing up data: take a database dump(jdbc cache store), copy the files (file cache store) etc. Also I'm not sure that the speed with which you take the dump is critical - i.e. even if you serialise/deserialize data might just work.

It's not just the performance hit we take on serialisation/de-serialisation, but also the additional CPU load we place on the system which should be running, performing transactions!

> Also in order to solve a) and b) I don't think ValueHolder.getBytes[] is the way to go. E.g. for the bucket cache stores use as read(and serialisation) unit an entire bucket so forcing them to returns the byte on an per entry basis would mean:
> - read the bucket as byte[] 
> - deserialize the bucket structure 
> - iterate over entries in the bucket and serialise them again in order to satisfy ValueHolder.getBytes[] 

That's just the way buckets are currently designed.  If, for example, each bucket has a header with a structure that looks like: [key1][position1][key2][position2][end-of-keys marker][value1][value2], then just by reading the header part of the bucket, we can grab chunks based on the position information for the values without deserializing them.  Of course, this is an "efficient" implementation.  A naive one could do what you said above and still comply with the contract.

> A better approach for this is to have toStream and fromStream methods similar to what we currently have CacheStore, so that the whole marshalling/unmarshalling business is delegated to the CacheStore itself. Also now that we're here, the CacheStore.toStream/fromStream API was added with the intention of solving the same problem some 4 years ago and are not used at all at this stage, though implemented by all the existing store.

Yes, but I think we can do better than the toStream/fromStream API.

> Bottom line for 3:  I think this is a case in which we should stick to the "if you're not sure don't add it" rule. We can always add it later: a new interface StreamableCacheLoader to extend CacheLoder.

Not always true, since in this case, the API we choose may dictate storage format on disk, which in turn will become a compatibility issue when reading data written using an older version of the same cache loader.

> 
> [1] http://en.wikipedia.org/wiki/Interface_segregation_principle
> [2] but Sanne / joke
> 
> On 8 Aug 2013, at 17:26, Manik Surtani <msurtani at redhat.com> wrote:
> 
>> Hey guys
>> 
>> This was good fun today.
>> 
>> Regarding the parallelised "process()" method, we should also look at Java 8 collections (which are introducing similar methods) and see if there is something we can learn (API-wise) there.
>> 
>> http://www.javabeat.net/2012/05/enhanced-collections-api-in-java-8-supports-lambda-expressions/
>> http://download.java.net/jdk8/docs/api/
>> http://download.java.net/jdk8/docs/api/java/util/Map.html#compute(K, java.util.function.BiFunction)
>> 
>> Also, what we didn't chat about: exposing ByteBuffers.  I suppose instead of exposing byte[] in ValueHolder, we should provide a reference to a ByteBuffers - http://download.java.net/jdk8/docs/api/java/nio/ByteBuffer.html - and also provide similar techniques for writing ByteBuffers on AdvancedCacheWriter.
>> 
>> And then all we have to do is re-implement the DataContainer to use ByteBuffers as well, and we can take advantage of Bela's upcoming changes to JGroups!  :)
>> 
>> 
>> 
>> --
>> Manik Surtani
>> 
>> 
>> 
> 
> Cheers,
> -- 
> Mircea Markus
> Infinispan lead (www.infinispan.org)
> 
> 
> 
> 

--
Manik Surtani