[infinispan-dev] storing in memory data in binary format

Mircea Markus mircea.markus at jboss.com
Tue Oct 6 05:32:00 EDT 2009


After some more thinking, keeping data in binary format will bring  
some more advantages in the case of DIST.

Given a cluster with N nodes.

With current implementation, when doing a cache.get().
i) if data is on a remote node(probability for this is (N-1)/N [1])
	a)send a remote get (w - network latency)
         b)on the remote node, serialize the in-memory object and send  
it over (s - serialization duration)
         c) data is sent back (w)
         c)at the receiving node deserialize it and pass it to the  
caller (s)
ii) if data is on the same node just return it (no serialization)

so the average get performance for a cluster made out of N nodes:
AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N

On the other hand, when keeping the data in serialized format:
i) if data is on a remote node (probability is (N-1)/N)
     a) send a remote get (w - network latency)
     b) on the remote node the already serialized data is send across  
(w)
     c) at receiver the node deserializes the data (s - duration)
ii) the data is on the same node, in which case it needs to be  
deserialized - s
so the average get performance for a cluster made out of N nodes:
AvgT2 = ((N-1) * (2*w + s) + 1*s)/N


AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N

so for N=2 the average performance is the same: AvgT1-AvgT2=0.
For values greater than 2, AvgT1-AvgT2 is always positive, approaching  
the value S for 'big' clusters.

Note: the same performance benefit(i.e. s(N-2)/N) is in the case of  
put - because on the remote node, the object is no longer de-serialized.

[1] this is based on the assumption that when asking for a key that is  
not located on the current node,the cache will always do a remote get  
and not look into it's local backup. This optimization might increase   
the performance of the non-serialized approach. Manik, do we have  
something like this?

Cheers,
Mircea

On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:

>
> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>
>> LazyMarshalling (use of marshalled values internally) does precisely
>> this.  If MarshalledValues are used, then calculating the size is
>> quite easy, except that it may always spike to 2x the byte[] when  
>> both
>> forms (serialized and unserialized) are used.  This is a spike though
>> and one form is always cleared out at the end of every invocation.
> Marshalled values is different as it either keeps the serialized form
> or the Object.
> The approach I am talking about is to always keep the serialized form
> only, and deserialize for each read. This is needed in order to
> accurately determine the size of the cached objects.
>>
>>
>>
>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>
>>> Hi,
>>>
>>> While working on the coherence config converter, I've seen that they
>>> are able to specify how much memory a cache should use at a time (we
>>> also had a thought about this in past but abandoned the idea mainly
>>> due to performance). E.g. :
>>> <backing-map-scheme>
>>> <local-scheme>
>>>   <high-units>100m</high-units>
>>>   <unit-calculator>BINARY</unit-calculator>
>>>   <eviction-policy>LRU</eviction-policy>
>>> </local-scheme>
>>> </backing-map-scheme>
>>> When 100 MB is reached, data will start to be evicted. I know we
>>> support marshaled values, but it's not exactly the same thing.
>>> I've been wondering how do they achieve this, and I've found this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>> .
>>> The way it works (my understanding!) is by keeping the key + values
>>> in
>>> serialized form within the map. So, if you have both the key and
>>> value
>>> as a byte[], you can easily measure the memory fingerprint of the
>>> cached data.
>>> Now what about keeping data in maps in serialized form?
>>> Pros:
>>> - we would be able to support memory based eviction triggers.
>>> - in DIST mode, when doing a put we won't need to deserialize the
>>> data
>>> at the other end. This de-serialization might be redundant, as if
>>> another node asks for this data, we'll have to serialize it back  
>>> etc.
>>> - the sync puts would be faster, as the data gets only serialized
>>> (and
>>> doesn't get deserialized at the other end).
>>> - ???
>>> Cons:
>>> - data would be deserialized for each get request, adding a latency.
>>> Partially compensated by faster puts (see cons) and can be mitigated
>>> by using L1 caches (near caches)
>>> - ???
>>>
>>> Well I'm not even sure that this fits with our actual architecture,
>>> just brought this in for brainstorming :)
>>>
>>> Cheers,
>>> Mircea
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Manik Surtani
>> manik at jboss.org
>> Lead, Infinispan
>> Lead, JBoss Cache
>> http://www.infinispan.org
>> http://www.jbosscache.org
>>
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev




More information about the infinispan-dev mailing list