[infinispan-dev] storing in memory data in binary format

Thu Oct 15 11:34:26 EDT 2009

On 15 Oct 2009, at 15:15, Mircea Markus wrote:

>
> On Oct 15, 2009, at 5:12 PM, Sanne Grinovero wrote:
>
>> This is not related to the form of data in "L1" local cache right?
> No, L1 would work with Objects.

Umm, from a container perspective, there is no difference between L1  
and the rest of the cache.

The difference between a serialized and deserialized form is  
encapsulated in the MarshalledValue object, and this is where the  
optimisations happen (in either direction).  This is what we need to  
look at and make sure it satisfies all cases.

>> I'm relying on the fact L1 is helping for Lucene indexes, would be
>> quite a waste of processing if I had to deserialize the content for
>> each segment read as it's read-mostly.
>>
>> Sanne
>>
>> 2009/10/15 Mircea Markus <mircea.markus at jboss.com>:
>>> Anyone sees any functional reason for not keeping the data in binary
>>> format?
>>>
>>> On Oct 15, 2009, at 4:25 PM, Manik Surtani wrote:
>>>
>>>> Interesting.  We should look at the lazy unmarshalling feature
>>>> that we
>>>> already have.  Implemented using the MarshalledValueInterceptor [1]
>>>> and the MarshalledValue [2] wrapper class.  Probably provides
>>>> some, if
>>>> not all of the benefits outlined below, and if not all, probably  
>>>> can
>>>> be enhanced as such.
>>>>
>>>> Cheers
>>>> Manik
>>>>
>>>> [1] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/interceptors/MarshalledValueInterceptor.java?r=903
>>>> [2] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/marshall/MarshalledValue.java?r=756
>>>>
>>>> On 6 Oct 2009, at 10:32, Mircea Markus wrote:
>>>>
>>>>> After some more thinking, keeping data in binary format will bring
>>>>> some more advantages in the case of DIST.
>>>>>
>>>>> Given a cluster with N nodes.
>>>>>
>>>>> With current implementation, when doing a cache.get().
>>>>> i) if data is on a remote node(probability for this is (N-1)/N  
>>>>> [1])
>>>>>     a)send a remote get (w - network latency)
>>>>>       b)on the remote node, serialize the in-memory object and
>>>>> send
>>>>> it over (s - serialization duration)
>>>>>       c) data is sent back (w)
>>>>>       c)at the receiving node deserialize it and pass it to the
>>>>> caller (s)
>>>>> ii) if data is on the same node just return it (no serialization)
>>>>>
>>>>> so the average get performance for a cluster made out of N nodes:
>>>>> AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N
>>>>>
>>>>> On the other hand, when keeping the data in serialized format:
>>>>> i) if data is on a remote node (probability is (N-1)/N)
>>>>>   a) send a remote get (w - network latency)
>>>>>   b) on the remote node the already serialized data is send across
>>>>> (w)
>>>>>   c) at receiver the node deserializes the data (s - duration)
>>>>> ii) the data is on the same node, in which case it needs to be
>>>>> deserialized - s
>>>>> so the average get performance for a cluster made out of N nodes:
>>>>> AvgT2 = ((N-1) * (2*w + s) + 1*s)/N
>>>>>
>>>>>
>>>>> AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N
>>>>>
>>>>> so for N=2 the average performance is the same: AvgT1-AvgT2=0.
>>>>> For values greater than 2, AvgT1-AvgT2 is always positive,
>>>>> approaching
>>>>> the value S for 'big' clusters.
>>>>>
>>>>> Note: the same performance benefit(i.e. s(N-2)/N) is in the case  
>>>>> of
>>>>> put - because on the remote node, the object is no longer de-
>>>>> serialized.
>>>>>
>>>>> [1] this is based on the assumption that when asking for a key  
>>>>> that
>>>>> is
>>>>> not located on the current node,the cache will always do a remote
>>>>> get
>>>>> and not look into it's local backup. This optimization might
>>>>> increase
>>>>> the performance of the non-serialized approach. Manik, do we have
>>>>> something like this?
>>>>>
>>>>> Cheers,
>>>>> Mircea
>>>>>
>>>>> On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:
>>>>>
>>>>>>
>>>>>> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>>>>>>
>>>>>>> LazyMarshalling (use of marshalled values internally) does
>>>>>>> precisely
>>>>>>> this.  If MarshalledValues are used, then calculating the size  
>>>>>>> is
>>>>>>> quite easy, except that it may always spike to 2x the byte[]  
>>>>>>> when
>>>>>>> both
>>>>>>> forms (serialized and unserialized) are used.  This is a spike
>>>>>>> though
>>>>>>> and one form is always cleared out at the end of every
>>>>>>> invocation.
>>>>>> Marshalled values is different as it either keeps the serialized
>>>>>> form
>>>>>> or the Object.
>>>>>> The approach I am talking about is to always keep the serialized
>>>>>> form
>>>>>> only, and deserialize for each read. This is needed in order to
>>>>>> accurately determine the size of the cached objects.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> While working on the coherence config converter, I've seen that
>>>>>>>> they
>>>>>>>> are able to specify how much memory a cache should use at a  
>>>>>>>> time
>>>>>>>> (we
>>>>>>>> also had a thought about this in past but abandoned the idea
>>>>>>>> mainly
>>>>>>>> due to performance). E.g. :
>>>>>>>> <backing-map-scheme>
>>>>>>>> <local-scheme>
>>>>>>>> <high-units>100m</high-units>
>>>>>>>> <unit-calculator>BINARY</unit-calculator>
>>>>>>>> <eviction-policy>LRU</eviction-policy>
>>>>>>>> </local-scheme>
>>>>>>>> </backing-map-scheme>
>>>>>>>> When 100 MB is reached, data will start to be evicted. I know  
>>>>>>>> we
>>>>>>>> support marshaled values, but it's not exactly the same thing.
>>>>>>>> I've been wondering how do they achieve this, and I've found
>>>>>>>> this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>>>>>>> .
>>>>>>>> The way it works (my understanding!) is by keeping the key +
>>>>>>>> values
>>>>>>>> in
>>>>>>>> serialized form within the map. So, if you have both the key  
>>>>>>>> and
>>>>>>>> value
>>>>>>>> as a byte[], you can easily measure the memory fingerprint of
>>>>>>>> the
>>>>>>>> cached data.
>>>>>>>> Now what about keeping data in maps in serialized form?
>>>>>>>> Pros:
>>>>>>>> - we would be able to support memory based eviction triggers.
>>>>>>>> - in DIST mode, when doing a put we won't need to deserialize
>>>>>>>> the
>>>>>>>> data
>>>>>>>> at the other end. This de-serialization might be redundant, as
>>>>>>>> if
>>>>>>>> another node asks for this data, we'll have to serialize it  
>>>>>>>> back
>>>>>>>> etc.
>>>>>>>> - the sync puts would be faster, as the data gets only
>>>>>>>> serialized
>>>>>>>> (and
>>>>>>>> doesn't get deserialized at the other end).
>>>>>>>> - ???
>>>>>>>> Cons:
>>>>>>>> - data would be deserialized for each get request, adding a
>>>>>>>> latency.
>>>>>>>> Partially compensated by faster puts (see cons) and can be
>>>>>>>> mitigated
>>>>>>>> by using L1 caches (near caches)
>>>>>>>> - ???
>>>>>>>>
>>>>>>>> Well I'm not even sure that this fits with our actual
>>>>>>>> architecture,
>>>>>>>> just brought this in for brainstorming :)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mircea
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>> --
>>>>>>> Manik Surtani
>>>>>>> manik at jboss.org
>>>>>>> Lead, Infinispan
>>>>>>> Lead, JBoss Cache
>>>>>>> http://www.infinispan.org
>>>>>>> http://www.jbosscache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> --
>>>> Manik Surtani
>>>> manik at jboss.org
>>>> Lead, Infinispan
>>>> Lead, JBoss Cache
>>>> http://www.infinispan.org
>>>> http://www.jbosscache.org
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org