[infinispan-dev] Design session today

Fri Aug 9 07:32:49 EDT 2013

On 9 Aug 2013, at 11:35, Manik Surtani <msurtani at redhat.com> wrote:

>> 
>> 3. allowing the cache loader to expose unserialised data directly (ValueHolder.getBytes[]).
> 
> I used the name ValueHolder but this is a really poor term - how about ContentsProxy?  It is a proxy for the contents of the entry, exposing methods:
> 

What about Entry, or StoredEntry?

interface StoredEntry {
   getKey();
   getInternalCacheValue();
}

and

interface BinaryStoredEntry extends StoredEntry {
   ByteBuffer getBinaryKey();
   BinaryBuffer getBinaryCacheValue();
   BinaryBuffer getBinaryInternalCacheValue();
}

with the later added at the time we need it, if we need it.

> interface ContentsProxy {
>  ByteBuffer getValueBuffer();
>  ByteBuffer getInternalCacheValueBuffer();
>  InternalCacheValue getInternalCacheValue();
> 
>  // Same as above except this method only deserializes timestamps and metadata.  Not the actual value.
>  InternalCacheValue getSparseInternalCacheValue();
> }

> 
>> The use cases we had for this are: 
>> a) streaming data during rolling upgrades. This works for scenarios where the data format (user classes) haven't changed and the data is written directly to a persistent store in the destination cluster
>> b) backups. This can be a generic and efficient (no serialisation) way of creating a backup tool. 
> 
> There are two more:
> c) Pre-populating a cache store from an external resource.

yes, I think it's in the same category with b).

> d) Exposing the underlying byte buffers directly for placement into, say, a native data container or directly onto the network stack for transmission (once JGroups has moved to JDK 7).

+1

> 
>> I haven't thought a) entirely, but seems to me that only applies in to a rather specific rolling upgrade scenario.
>> Re: b) there might be some mere efficient ways of backing up data: take a database dump(jdbc cache store), copy the files (file cache store) etc. Also I'm not sure that the speed with which you take the dump is critical - i.e. even if you serialise/deserialize data might just work.
> 
> It's not just the performance hit we take on serialisation/de-serialisation, but also the additional CPU load we place on the system which should be running, performing transactions!

It might just work in practical terms: the backup operations are normally run when the system is not under load.

> 
>> Also in order to solve a) and b) I don't think ValueHolder.getBytes[] is the way to go. E.g. for the bucket cache stores use as read(and serialisation) unit an entire bucket so forcing them to returns the byte on an per entry basis would mean:
>> - read the bucket as byte[] 
>> - deserialize the bucket structure 
>> - iterate over entries in the bucket and serialise them again in order to satisfy ValueHolder.getBytes[] 
> 
> That's just the way buckets are currently designed.  If, for example, each bucket has a header with a structure that looks like: [key1][position1][key2][position2][end-of-keys marker][value1][value2], then just by reading the header part of the bucket, we can grab chunks based on the position information for the values without deserializing them.  Of course, this is an "efficient" implementation.  A naive one could do what you said above and still comply with the contract.

that would work :-)

> 
>> A better approach for this is to have toStream and fromStream methods similar to what we currently have CacheStore, so that the whole marshalling/unmarshalling business is delegated to the CacheStore itself. Also now that we're here, the CacheStore.toStream/fromStream API was added with the intention of solving the same problem some 4 years ago and are not used at all at this stage, though implemented by all the existing store.
> 
> Yes, but I think we can do better than the toStream/fromStream API.
> 
>> Bottom line for 3:  I think this is a case in which we should stick to the "if you're not sure don't add it" rule. We can always add it later: a new interface StreamableCacheLoader to extend CacheLoder.
> 
> Not always true, since in this case, the API we choose may dictate storage format on disk, which in turn will become a compatibility issue when reading data written using an older version of the same cache loader.

If they use rolling upgrades they don't have this problem. 
If they do upgrades with shutdown: they can use a migration tool (quite simple to write one). (There are discussions about adding a simple general purpose migration tool and email titled "migrating data between the FileCacheStore -> SingleFileCacheStore").

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)