[infinispan-dev] storing in memory data in binary format

Thu Oct 15 09:25:09 EDT 2009

Interesting.  We should look at the lazy unmarshalling feature that we  
already have.  Implemented using the MarshalledValueInterceptor [1]  
and the MarshalledValue [2] wrapper class.  Probably provides some, if  
not all of the benefits outlined below, and if not all, probably can  
be enhanced as such.

Cheers
Manik

[1] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/interceptors/MarshalledValueInterceptor.java?r=903
[2] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/marshall/MarshalledValue.java?r=756

On 6 Oct 2009, at 10:32, Mircea Markus wrote:

> After some more thinking, keeping data in binary format will bring
> some more advantages in the case of DIST.
>
> Given a cluster with N nodes.
>
> With current implementation, when doing a cache.get().
> i) if data is on a remote node(probability for this is (N-1)/N [1])
> 	a)send a remote get (w - network latency)
>         b)on the remote node, serialize the in-memory object and send
> it over (s - serialization duration)
>         c) data is sent back (w)
>         c)at the receiving node deserialize it and pass it to the
> caller (s)
> ii) if data is on the same node just return it (no serialization)
>
> so the average get performance for a cluster made out of N nodes:
> AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N
>
> On the other hand, when keeping the data in serialized format:
> i) if data is on a remote node (probability is (N-1)/N)
>     a) send a remote get (w - network latency)
>     b) on the remote node the already serialized data is send across
> (w)
>     c) at receiver the node deserializes the data (s - duration)
> ii) the data is on the same node, in which case it needs to be
> deserialized - s
> so the average get performance for a cluster made out of N nodes:
> AvgT2 = ((N-1) * (2*w + s) + 1*s)/N
>
>
> AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N
>
> so for N=2 the average performance is the same: AvgT1-AvgT2=0.
> For values greater than 2, AvgT1-AvgT2 is always positive, approaching
> the value S for 'big' clusters.
>
> Note: the same performance benefit(i.e. s(N-2)/N) is in the case of
> put - because on the remote node, the object is no longer de- 
> serialized.
>
> [1] this is based on the assumption that when asking for a key that is
> not located on the current node,the cache will always do a remote get
> and not look into it's local backup. This optimization might increase
> the performance of the non-serialized approach. Manik, do we have
> something like this?
>
> Cheers,
> Mircea
>
> On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:
>
>>
>> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>>
>>> LazyMarshalling (use of marshalled values internally) does precisely
>>> this.  If MarshalledValues are used, then calculating the size is
>>> quite easy, except that it may always spike to 2x the byte[] when
>>> both
>>> forms (serialized and unserialized) are used.  This is a spike  
>>> though
>>> and one form is always cleared out at the end of every invocation.
>> Marshalled values is different as it either keeps the serialized form
>> or the Object.
>> The approach I am talking about is to always keep the serialized form
>> only, and deserialize for each read. This is needed in order to
>> accurately determine the size of the cached objects.
>>>
>>>
>>>
>>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>>
>>>> Hi,
>>>>
>>>> While working on the coherence config converter, I've seen that  
>>>> they
>>>> are able to specify how much memory a cache should use at a time  
>>>> (we
>>>> also had a thought about this in past but abandoned the idea mainly
>>>> due to performance). E.g. :
>>>> <backing-map-scheme>
>>>> <local-scheme>
>>>>  <high-units>100m</high-units>
>>>>  <unit-calculator>BINARY</unit-calculator>
>>>>  <eviction-policy>LRU</eviction-policy>
>>>> </local-scheme>
>>>> </backing-map-scheme>
>>>> When 100 MB is reached, data will start to be evicted. I know we
>>>> support marshaled values, but it's not exactly the same thing.
>>>> I've been wondering how do they achieve this, and I've found this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>>> .
>>>> The way it works (my understanding!) is by keeping the key + values
>>>> in
>>>> serialized form within the map. So, if you have both the key and
>>>> value
>>>> as a byte[], you can easily measure the memory fingerprint of the
>>>> cached data.
>>>> Now what about keeping data in maps in serialized form?
>>>> Pros:
>>>> - we would be able to support memory based eviction triggers.
>>>> - in DIST mode, when doing a put we won't need to deserialize the
>>>> data
>>>> at the other end. This de-serialization might be redundant, as if
>>>> another node asks for this data, we'll have to serialize it back
>>>> etc.
>>>> - the sync puts would be faster, as the data gets only serialized
>>>> (and
>>>> doesn't get deserialized at the other end).
>>>> - ???
>>>> Cons:
>>>> - data would be deserialized for each get request, adding a  
>>>> latency.
>>>> Partially compensated by faster puts (see cons) and can be  
>>>> mitigated
>>>> by using L1 caches (near caches)
>>>> - ???
>>>>
>>>> Well I'm not even sure that this fits with our actual architecture,
>>>> just brought this in for brainstorming :)
>>>>
>>>> Cheers,
>>>> Mircea
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Manik Surtani
>>> manik at jboss.org
>>> Lead, Infinispan
>>> Lead, JBoss Cache
>>> http://www.infinispan.org
>>> http://www.jbosscache.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org