We should actually move all of this to infinispan-dev - cc'ing infinispan-dev on my
response.
On 9 Aug 2013, at 11:19, Mircea Markus <mmarkus(a)redhat.com> wrote:
Hi,
I've been giving some thought last evening and here are some second-day thoughts:
1. parallel processing is a great idea and I think its really something that would make a
difference compared to our competition
+1. We should consider the JDK 8 collections APIs as a reference, as I mentioned.
2. using a two interfaces(CacheLoader,CacheWriter) over one. I'm still not totally
bought by the idea
Pros: cleaner design (interface segregation principle[1]) which would allow users to only
implement what they need
Cons: the difference between cache loader and cache store (or writer) has been a source
of confusion through the users, as most of the users[2] only use the combined version
I'll continue the discussion on the public list
Also, JSR 107 (and from that, most other data grids) will also follow a separate
CacheLoader/CacheWriter. I think people will get used to the separation of interfaces.
3. allowing the cache loader to expose unserialised data directly
(ValueHolder.getBytes[]).
I used the name ValueHolder but this is a really poor term - how about ContentsProxy? It
is a proxy for the contents of the entry, exposing methods:
interface ContentsProxy {
ByteBuffer getValueBuffer();
ByteBuffer getInternalCacheValueBuffer();
InternalCacheValue getInternalCacheValue();
// Same as above except this method only deserializes timestamps and metadata. Not the
actual value.
InternalCacheValue getSparseInternalCacheValue();
}
The use cases we had for this are:
a) streaming data during rolling upgrades. This works for scenarios where the data format
(user classes) haven't changed and the data is written directly to a persistent store
in the destination cluster
b) backups. This can be a generic and efficient (no serialisation) way of creating a
backup tool.
There are two more:
c) Pre-populating a cache store from an external resource.
d) Exposing the underlying byte buffers directly for placement into, say, a native data
container or directly onto the network stack for transmission (once JGroups has moved to
JDK 7).
I haven't thought a) entirely, but seems to me that only applies
in to a rather specific rolling upgrade scenario.
Re: b) there might be some mere efficient ways of backing up data: take a database
dump(jdbc cache store), copy the files (file cache store) etc. Also I'm not sure that
the speed with which you take the dump is critical - i.e. even if you
serialise/deserialize data might just work.
It's not just the performance hit we take on serialisation/de-serialisation, but also
the additional CPU load we place on the system which should be running, performing
transactions!
Also in order to solve a) and b) I don't think
ValueHolder.getBytes[] is the way to go. E.g. for the bucket cache stores use as read(and
serialisation) unit an entire bucket so forcing them to returns the byte on an per entry
basis would mean:
- read the bucket as byte[]
- deserialize the bucket structure
- iterate over entries in the bucket and serialise them again in order to satisfy
ValueHolder.getBytes[]
That's just the way buckets are currently designed. If, for example, each bucket has
a header with a structure that looks like: [key1][position1][key2][position2][end-of-keys
marker][value1][value2], then just by reading the header part of the bucket, we can grab
chunks based on the position information for the values without deserializing them. Of
course, this is an "efficient" implementation. A naive one could do what you
said above and still comply with the contract.
A better approach for this is to have toStream and fromStream methods
similar to what we currently have CacheStore, so that the whole marshalling/unmarshalling
business is delegated to the CacheStore itself. Also now that we're here, the
CacheStore.toStream/fromStream API was added with the intention of solving the same
problem some 4 years ago and are not used at all at this stage, though implemented by all
the existing store.
Yes, but I think we can do better than the toStream/fromStream API.
Bottom line for 3: I think this is a case in which we should stick
to the "if you're not sure don't add it" rule. We can always add it
later: a new interface StreamableCacheLoader to extend CacheLoder.
Not always true, since in this case, the API we choose may dictate storage format on disk,
which in turn will become a compatibility issue when reading data written using an older
version of the same cache loader.
[1]
http://en.wikipedia.org/wiki/Interface_segregation_principle
[2] but Sanne / joke
On 8 Aug 2013, at 17:26, Manik Surtani <msurtani(a)redhat.com> wrote:
> Hey guys
>
> This was good fun today.
>
> Regarding the parallelised "process()" method, we should also look at Java
8 collections (which are introducing similar methods) and see if there is something we can
learn (API-wise) there.
>
>
http://www.javabeat.net/2012/05/enhanced-collections-api-in-java-8-suppor...
>
http://download.java.net/jdk8/docs/api/
>
http://download.java.net/jdk8/docs/api/java/util/Map.html#compute(K,
java.util.function.BiFunction)
>
> Also, what we didn't chat about: exposing ByteBuffers. I suppose instead of
exposing byte[] in ValueHolder, we should provide a reference to a ByteBuffers -
http://download.java.net/jdk8/docs/api/java/nio/ByteBuffer.html - and also provide similar
techniques for writing ByteBuffers on AdvancedCacheWriter.
>
> And then all we have to do is re-implement the DataContainer to use ByteBuffers as
well, and we can take advantage of Bela's upcoming changes to JGroups! :)
>
>
>
> --
> Manik Surtani
>
>
>
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)
--
Manik Surtani