[infinispan-dev] storing in memory data in binary format

Thu Oct 15 10:15:11 EDT 2009

On Oct 15, 2009, at 5:12 PM, Sanne Grinovero wrote:

> This is not related to the form of data in "L1" local cache right?
No, L1 would work with Objects.
> I'm relying on the fact L1 is helping for Lucene indexes, would be
> quite a waste of processing if I had to deserialize the content for
> each segment read as it's read-mostly.
>
> Sanne
>
> 2009/10/15 Mircea Markus <mircea.markus at jboss.com>:
>> Anyone sees any functional reason for not keeping the data in binary
>> format?
>>
>> On Oct 15, 2009, at 4:25 PM, Manik Surtani wrote:
>>
>>> Interesting.  We should look at the lazy unmarshalling feature  
>>> that we
>>> already have.  Implemented using the MarshalledValueInterceptor [1]
>>> and the MarshalledValue [2] wrapper class.  Probably provides  
>>> some, if
>>> not all of the benefits outlined below, and if not all, probably can
>>> be enhanced as such.
>>>
>>> Cheers
>>> Manik
>>>
>>> [1] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/interceptors/MarshalledValueInterceptor.java?r=903
>>> [2] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/marshall/MarshalledValue.java?r=756
>>>
>>> On 6 Oct 2009, at 10:32, Mircea Markus wrote:
>>>
>>>> After some more thinking, keeping data in binary format will bring
>>>> some more advantages in the case of DIST.
>>>>
>>>> Given a cluster with N nodes.
>>>>
>>>> With current implementation, when doing a cache.get().
>>>> i) if data is on a remote node(probability for this is (N-1)/N [1])
>>>>      a)send a remote get (w - network latency)
>>>>        b)on the remote node, serialize the in-memory object and  
>>>> send
>>>> it over (s - serialization duration)
>>>>        c) data is sent back (w)
>>>>        c)at the receiving node deserialize it and pass it to the
>>>> caller (s)
>>>> ii) if data is on the same node just return it (no serialization)
>>>>
>>>> so the average get performance for a cluster made out of N nodes:
>>>> AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N
>>>>
>>>> On the other hand, when keeping the data in serialized format:
>>>> i) if data is on a remote node (probability is (N-1)/N)
>>>>    a) send a remote get (w - network latency)
>>>>    b) on the remote node the already serialized data is send across
>>>> (w)
>>>>    c) at receiver the node deserializes the data (s - duration)
>>>> ii) the data is on the same node, in which case it needs to be
>>>> deserialized - s
>>>> so the average get performance for a cluster made out of N nodes:
>>>> AvgT2 = ((N-1) * (2*w + s) + 1*s)/N
>>>>
>>>>
>>>> AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N
>>>>
>>>> so for N=2 the average performance is the same: AvgT1-AvgT2=0.
>>>> For values greater than 2, AvgT1-AvgT2 is always positive,
>>>> approaching
>>>> the value S for 'big' clusters.
>>>>
>>>> Note: the same performance benefit(i.e. s(N-2)/N) is in the case of
>>>> put - because on the remote node, the object is no longer de-
>>>> serialized.
>>>>
>>>> [1] this is based on the assumption that when asking for a key that
>>>> is
>>>> not located on the current node,the cache will always do a remote  
>>>> get
>>>> and not look into it's local backup. This optimization might  
>>>> increase
>>>> the performance of the non-serialized approach. Manik, do we have
>>>> something like this?
>>>>
>>>> Cheers,
>>>> Mircea
>>>>
>>>> On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:
>>>>
>>>>>
>>>>> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>>>>>
>>>>>> LazyMarshalling (use of marshalled values internally) does
>>>>>> precisely
>>>>>> this.  If MarshalledValues are used, then calculating the size is
>>>>>> quite easy, except that it may always spike to 2x the byte[] when
>>>>>> both
>>>>>> forms (serialized and unserialized) are used.  This is a spike
>>>>>> though
>>>>>> and one form is always cleared out at the end of every  
>>>>>> invocation.
>>>>> Marshalled values is different as it either keeps the serialized
>>>>> form
>>>>> or the Object.
>>>>> The approach I am talking about is to always keep the serialized
>>>>> form
>>>>> only, and deserialize for each read. This is needed in order to
>>>>> accurately determine the size of the cached objects.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> While working on the coherence config converter, I've seen that
>>>>>>> they
>>>>>>> are able to specify how much memory a cache should use at a time
>>>>>>> (we
>>>>>>> also had a thought about this in past but abandoned the idea
>>>>>>> mainly
>>>>>>> due to performance). E.g. :
>>>>>>> <backing-map-scheme>
>>>>>>> <local-scheme>
>>>>>>> <high-units>100m</high-units>
>>>>>>> <unit-calculator>BINARY</unit-calculator>
>>>>>>> <eviction-policy>LRU</eviction-policy>
>>>>>>> </local-scheme>
>>>>>>> </backing-map-scheme>
>>>>>>> When 100 MB is reached, data will start to be evicted. I know we
>>>>>>> support marshaled values, but it's not exactly the same thing.
>>>>>>> I've been wondering how do they achieve this, and I've found
>>>>>>> this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>>>>>> .
>>>>>>> The way it works (my understanding!) is by keeping the key +
>>>>>>> values
>>>>>>> in
>>>>>>> serialized form within the map. So, if you have both the key and
>>>>>>> value
>>>>>>> as a byte[], you can easily measure the memory fingerprint of  
>>>>>>> the
>>>>>>> cached data.
>>>>>>> Now what about keeping data in maps in serialized form?
>>>>>>> Pros:
>>>>>>> - we would be able to support memory based eviction triggers.
>>>>>>> - in DIST mode, when doing a put we won't need to deserialize  
>>>>>>> the
>>>>>>> data
>>>>>>> at the other end. This de-serialization might be redundant, as  
>>>>>>> if
>>>>>>> another node asks for this data, we'll have to serialize it back
>>>>>>> etc.
>>>>>>> - the sync puts would be faster, as the data gets only  
>>>>>>> serialized
>>>>>>> (and
>>>>>>> doesn't get deserialized at the other end).
>>>>>>> - ???
>>>>>>> Cons:
>>>>>>> - data would be deserialized for each get request, adding a
>>>>>>> latency.
>>>>>>> Partially compensated by faster puts (see cons) and can be
>>>>>>> mitigated
>>>>>>> by using L1 caches (near caches)
>>>>>>> - ???
>>>>>>>
>>>>>>> Well I'm not even sure that this fits with our actual
>>>>>>> architecture,
>>>>>>> just brought this in for brainstorming :)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Mircea
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>> --
>>>>>> Manik Surtani
>>>>>> manik at jboss.org
>>>>>> Lead, Infinispan
>>>>>> Lead, JBoss Cache
>>>>>> http://www.infinispan.org
>>>>>> http://www.jbosscache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Manik Surtani
>>> manik at jboss.org
>>> Lead, Infinispan
>>> Lead, JBoss Cache
>>> http://www.infinispan.org
>>> http://www.jbosscache.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev