[infinispan-dev] storing in memory data in binary format

Mircea Markus mircea.markus at jboss.com
Thu Oct 15 10:07:34 EDT 2009


Anyone sees any functional reason for not keeping the data in binary  
format?

On Oct 15, 2009, at 4:25 PM, Manik Surtani wrote:

> Interesting.  We should look at the lazy unmarshalling feature that we
> already have.  Implemented using the MarshalledValueInterceptor [1]
> and the MarshalledValue [2] wrapper class.  Probably provides some, if
> not all of the benefits outlined below, and if not all, probably can
> be enhanced as such.
>
> Cheers
> Manik
>
> [1] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/interceptors/MarshalledValueInterceptor.java?r=903
> [2] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/marshall/MarshalledValue.java?r=756
>
> On 6 Oct 2009, at 10:32, Mircea Markus wrote:
>
>> After some more thinking, keeping data in binary format will bring
>> some more advantages in the case of DIST.
>>
>> Given a cluster with N nodes.
>>
>> With current implementation, when doing a cache.get().
>> i) if data is on a remote node(probability for this is (N-1)/N [1])
>> 	a)send a remote get (w - network latency)
>>        b)on the remote node, serialize the in-memory object and send
>> it over (s - serialization duration)
>>        c) data is sent back (w)
>>        c)at the receiving node deserialize it and pass it to the
>> caller (s)
>> ii) if data is on the same node just return it (no serialization)
>>
>> so the average get performance for a cluster made out of N nodes:
>> AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N
>>
>> On the other hand, when keeping the data in serialized format:
>> i) if data is on a remote node (probability is (N-1)/N)
>>    a) send a remote get (w - network latency)
>>    b) on the remote node the already serialized data is send across
>> (w)
>>    c) at receiver the node deserializes the data (s - duration)
>> ii) the data is on the same node, in which case it needs to be
>> deserialized - s
>> so the average get performance for a cluster made out of N nodes:
>> AvgT2 = ((N-1) * (2*w + s) + 1*s)/N
>>
>>
>> AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N
>>
>> so for N=2 the average performance is the same: AvgT1-AvgT2=0.
>> For values greater than 2, AvgT1-AvgT2 is always positive,  
>> approaching
>> the value S for 'big' clusters.
>>
>> Note: the same performance benefit(i.e. s(N-2)/N) is in the case of
>> put - because on the remote node, the object is no longer de-
>> serialized.
>>
>> [1] this is based on the assumption that when asking for a key that  
>> is
>> not located on the current node,the cache will always do a remote get
>> and not look into it's local backup. This optimization might increase
>> the performance of the non-serialized approach. Manik, do we have
>> something like this?
>>
>> Cheers,
>> Mircea
>>
>> On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:
>>
>>>
>>> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>>>
>>>> LazyMarshalling (use of marshalled values internally) does  
>>>> precisely
>>>> this.  If MarshalledValues are used, then calculating the size is
>>>> quite easy, except that it may always spike to 2x the byte[] when
>>>> both
>>>> forms (serialized and unserialized) are used.  This is a spike
>>>> though
>>>> and one form is always cleared out at the end of every invocation.
>>> Marshalled values is different as it either keeps the serialized  
>>> form
>>> or the Object.
>>> The approach I am talking about is to always keep the serialized  
>>> form
>>> only, and deserialize for each read. This is needed in order to
>>> accurately determine the size of the cached objects.
>>>>
>>>>
>>>>
>>>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> While working on the coherence config converter, I've seen that
>>>>> they
>>>>> are able to specify how much memory a cache should use at a time
>>>>> (we
>>>>> also had a thought about this in past but abandoned the idea  
>>>>> mainly
>>>>> due to performance). E.g. :
>>>>> <backing-map-scheme>
>>>>> <local-scheme>
>>>>> <high-units>100m</high-units>
>>>>> <unit-calculator>BINARY</unit-calculator>
>>>>> <eviction-policy>LRU</eviction-policy>
>>>>> </local-scheme>
>>>>> </backing-map-scheme>
>>>>> When 100 MB is reached, data will start to be evicted. I know we
>>>>> support marshaled values, but it's not exactly the same thing.
>>>>> I've been wondering how do they achieve this, and I've found  
>>>>> this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>>>> .
>>>>> The way it works (my understanding!) is by keeping the key +  
>>>>> values
>>>>> in
>>>>> serialized form within the map. So, if you have both the key and
>>>>> value
>>>>> as a byte[], you can easily measure the memory fingerprint of the
>>>>> cached data.
>>>>> Now what about keeping data in maps in serialized form?
>>>>> Pros:
>>>>> - we would be able to support memory based eviction triggers.
>>>>> - in DIST mode, when doing a put we won't need to deserialize the
>>>>> data
>>>>> at the other end. This de-serialization might be redundant, as if
>>>>> another node asks for this data, we'll have to serialize it back
>>>>> etc.
>>>>> - the sync puts would be faster, as the data gets only serialized
>>>>> (and
>>>>> doesn't get deserialized at the other end).
>>>>> - ???
>>>>> Cons:
>>>>> - data would be deserialized for each get request, adding a
>>>>> latency.
>>>>> Partially compensated by faster puts (see cons) and can be
>>>>> mitigated
>>>>> by using L1 caches (near caches)
>>>>> - ???
>>>>>
>>>>> Well I'm not even sure that this fits with our actual  
>>>>> architecture,
>>>>> just brought this in for brainstorming :)
>>>>>
>>>>> Cheers,
>>>>> Mircea
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> --
>>>> Manik Surtani
>>>> manik at jboss.org
>>>> Lead, Infinispan
>>>> Lead, JBoss Cache
>>>> http://www.infinispan.org
>>>> http://www.jbosscache.org
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> manik at jboss.org
> Lead, Infinispan
> Lead, JBoss Cache
> http://www.infinispan.org
> http://www.jbosscache.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev




More information about the infinispan-dev mailing list