[infinispan-dev] storing in memory data in binary format

Sanne Grinovero sanne.grinovero at gmail.com
Thu Oct 15 10:12:57 EDT 2009


This is not related to the form of data in "L1" local cache right?
I'm relying on the fact L1 is helping for Lucene indexes, would be
quite a waste of processing if I had to deserialize the content for
each segment read as it's read-mostly.

Sanne

2009/10/15 Mircea Markus <mircea.markus at jboss.com>:
> Anyone sees any functional reason for not keeping the data in binary
> format?
>
> On Oct 15, 2009, at 4:25 PM, Manik Surtani wrote:
>
>> Interesting.  We should look at the lazy unmarshalling feature that we
>> already have.  Implemented using the MarshalledValueInterceptor [1]
>> and the MarshalledValue [2] wrapper class.  Probably provides some, if
>> not all of the benefits outlined below, and if not all, probably can
>> be enhanced as such.
>>
>> Cheers
>> Manik
>>
>> [1] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/interceptors/MarshalledValueInterceptor.java?r=903
>> [2] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/marshall/MarshalledValue.java?r=756
>>
>> On 6 Oct 2009, at 10:32, Mircea Markus wrote:
>>
>>> After some more thinking, keeping data in binary format will bring
>>> some more advantages in the case of DIST.
>>>
>>> Given a cluster with N nodes.
>>>
>>> With current implementation, when doing a cache.get().
>>> i) if data is on a remote node(probability for this is (N-1)/N [1])
>>>      a)send a remote get (w - network latency)
>>>        b)on the remote node, serialize the in-memory object and send
>>> it over (s - serialization duration)
>>>        c) data is sent back (w)
>>>        c)at the receiving node deserialize it and pass it to the
>>> caller (s)
>>> ii) if data is on the same node just return it (no serialization)
>>>
>>> so the average get performance for a cluster made out of N nodes:
>>> AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N
>>>
>>> On the other hand, when keeping the data in serialized format:
>>> i) if data is on a remote node (probability is (N-1)/N)
>>>    a) send a remote get (w - network latency)
>>>    b) on the remote node the already serialized data is send across
>>> (w)
>>>    c) at receiver the node deserializes the data (s - duration)
>>> ii) the data is on the same node, in which case it needs to be
>>> deserialized - s
>>> so the average get performance for a cluster made out of N nodes:
>>> AvgT2 = ((N-1) * (2*w + s) + 1*s)/N
>>>
>>>
>>> AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N
>>>
>>> so for N=2 the average performance is the same: AvgT1-AvgT2=0.
>>> For values greater than 2, AvgT1-AvgT2 is always positive,
>>> approaching
>>> the value S for 'big' clusters.
>>>
>>> Note: the same performance benefit(i.e. s(N-2)/N) is in the case of
>>> put - because on the remote node, the object is no longer de-
>>> serialized.
>>>
>>> [1] this is based on the assumption that when asking for a key that
>>> is
>>> not located on the current node,the cache will always do a remote get
>>> and not look into it's local backup. This optimization might increase
>>> the performance of the non-serialized approach. Manik, do we have
>>> something like this?
>>>
>>> Cheers,
>>> Mircea
>>>
>>> On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:
>>>
>>>>
>>>> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>>>>
>>>>> LazyMarshalling (use of marshalled values internally) does
>>>>> precisely
>>>>> this.  If MarshalledValues are used, then calculating the size is
>>>>> quite easy, except that it may always spike to 2x the byte[] when
>>>>> both
>>>>> forms (serialized and unserialized) are used.  This is a spike
>>>>> though
>>>>> and one form is always cleared out at the end of every invocation.
>>>> Marshalled values is different as it either keeps the serialized
>>>> form
>>>> or the Object.
>>>> The approach I am talking about is to always keep the serialized
>>>> form
>>>> only, and deserialize for each read. This is needed in order to
>>>> accurately determine the size of the cached objects.
>>>>>
>>>>>
>>>>>
>>>>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> While working on the coherence config converter, I've seen that
>>>>>> they
>>>>>> are able to specify how much memory a cache should use at a time
>>>>>> (we
>>>>>> also had a thought about this in past but abandoned the idea
>>>>>> mainly
>>>>>> due to performance). E.g. :
>>>>>> <backing-map-scheme>
>>>>>> <local-scheme>
>>>>>> <high-units>100m</high-units>
>>>>>> <unit-calculator>BINARY</unit-calculator>
>>>>>> <eviction-policy>LRU</eviction-policy>
>>>>>> </local-scheme>
>>>>>> </backing-map-scheme>
>>>>>> When 100 MB is reached, data will start to be evicted. I know we
>>>>>> support marshaled values, but it's not exactly the same thing.
>>>>>> I've been wondering how do they achieve this, and I've found
>>>>>> this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>>>>> .
>>>>>> The way it works (my understanding!) is by keeping the key +
>>>>>> values
>>>>>> in
>>>>>> serialized form within the map. So, if you have both the key and
>>>>>> value
>>>>>> as a byte[], you can easily measure the memory fingerprint of the
>>>>>> cached data.
>>>>>> Now what about keeping data in maps in serialized form?
>>>>>> Pros:
>>>>>> - we would be able to support memory based eviction triggers.
>>>>>> - in DIST mode, when doing a put we won't need to deserialize the
>>>>>> data
>>>>>> at the other end. This de-serialization might be redundant, as if
>>>>>> another node asks for this data, we'll have to serialize it back
>>>>>> etc.
>>>>>> - the sync puts would be faster, as the data gets only serialized
>>>>>> (and
>>>>>> doesn't get deserialized at the other end).
>>>>>> - ???
>>>>>> Cons:
>>>>>> - data would be deserialized for each get request, adding a
>>>>>> latency.
>>>>>> Partially compensated by faster puts (see cons) and can be
>>>>>> mitigated
>>>>>> by using L1 caches (near caches)
>>>>>> - ???
>>>>>>
>>>>>> Well I'm not even sure that this fits with our actual
>>>>>> architecture,
>>>>>> just brought this in for brainstorming :)
>>>>>>
>>>>>> Cheers,
>>>>>> Mircea
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>> --
>>>>> Manik Surtani
>>>>> manik at jboss.org
>>>>> Lead, Infinispan
>>>>> Lead, JBoss Cache
>>>>> http://www.infinispan.org
>>>>> http://www.jbosscache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Manik Surtani
>> manik at jboss.org
>> Lead, Infinispan
>> Lead, JBoss Cache
>> http://www.infinispan.org
>> http://www.jbosscache.org
>>
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>




More information about the infinispan-dev mailing list