[infinispan-dev] storing in memory data in binary format

Tue Oct 6 06:05:30 EDT 2009

And here I was thinking storing things as a byte[] was silly ;) interesting...

On Tue, Oct 6, 2009 at 8:32 PM, Mircea Markus <mircea.markus at jboss.com> wrote:
> After some more thinking, keeping data in binary format will bring
> some more advantages in the case of DIST.
>
> Given a cluster with N nodes.
>
> With current implementation, when doing a cache.get().
> i) if data is on a remote node(probability for this is (N-1)/N [1])
>        a)send a remote get (w - network latency)
>         b)on the remote node, serialize the in-memory object and send
> it over (s - serialization duration)
>         c) data is sent back (w)
>         c)at the receiving node deserialize it and pass it to the
> caller (s)
> ii) if data is on the same node just return it (no serialization)
>
> so the average get performance for a cluster made out of N nodes:
> AvgT1 = ((N-1)* i + 1* ii)/N = ((N-1)* (2*w + 2*s) + 0)/N
>
> On the other hand, when keeping the data in serialized format:
> i) if data is on a remote node (probability is (N-1)/N)
>     a) send a remote get (w - network latency)
>     b) on the remote node the already serialized data is send across
> (w)
>     c) at receiver the node deserializes the data (s - duration)
> ii) the data is on the same node, in which case it needs to be
> deserialized - s
> so the average get performance for a cluster made out of N nodes:
> AvgT2 = ((N-1) * (2*w + s) + 1*s)/N
>
>
> AvgT1-AvgT2 =  ((N-1)s - s)/N = s(N-2)/N
>
> so for N=2 the average performance is the same: AvgT1-AvgT2=0.
> For values greater than 2, AvgT1-AvgT2 is always positive, approaching
> the value S for 'big' clusters.
>
> Note: the same performance benefit(i.e. s(N-2)/N) is in the case of
> put - because on the remote node, the object is no longer de-serialized.
>
> [1] this is based on the assumption that when asking for a key that is
> not located on the current node,the cache will always do a remote get
> and not look into it's local backup. This optimization might increase
> the performance of the non-serialized approach. Manik, do we have
> something like this?
>
> Cheers,
> Mircea
>
> On Oct 2, 2009, at 5:57 PM, Mircea Markus wrote:
>
>>
>> On Oct 2, 2009, at 5:13 PM, Manik Surtani wrote:
>>
>>> LazyMarshalling (use of marshalled values internally) does precisely
>>> this.  If MarshalledValues are used, then calculating the size is
>>> quite easy, except that it may always spike to 2x the byte[] when
>>> both
>>> forms (serialized and unserialized) are used.  This is a spike though
>>> and one form is always cleared out at the end of every invocation.
>> Marshalled values is different as it either keeps the serialized form
>> or the Object.
>> The approach I am talking about is to always keep the serialized form
>> only, and deserialize for each read. This is needed in order to
>> accurately determine the size of the cached objects.
>>>
>>>
>>>
>>> On 2 Oct 2009, at 12:44, Mircea Markus wrote:
>>>
>>>> Hi,
>>>>
>>>> While working on the coherence config converter, I've seen that they
>>>> are able to specify how much memory a cache should use at a time (we
>>>> also had a thought about this in past but abandoned the idea mainly
>>>> due to performance). E.g. :
>>>> <backing-map-scheme>
>>>> <local-scheme>
>>>>   <high-units>100m</high-units>
>>>>   <unit-calculator>BINARY</unit-calculator>
>>>>   <eviction-policy>LRU</eviction-policy>
>>>> </local-scheme>
>>>> </backing-map-scheme>
>>>> When 100 MB is reached, data will start to be evicted. I know we
>>>> support marshaled values, but it's not exactly the same thing.
>>>> I've been wondering how do they achieve this, and I've found this http://coherence.oracle.com/display/COH35UG/Storage+and+Backing+Map
>>>> .
>>>> The way it works (my understanding!) is by keeping the key + values
>>>> in
>>>> serialized form within the map. So, if you have both the key and
>>>> value
>>>> as a byte[], you can easily measure the memory fingerprint of the
>>>> cached data.
>>>> Now what about keeping data in maps in serialized form?
>>>> Pros:
>>>> - we would be able to support memory based eviction triggers.
>>>> - in DIST mode, when doing a put we won't need to deserialize the
>>>> data
>>>> at the other end. This de-serialization might be redundant, as if
>>>> another node asks for this data, we'll have to serialize it back
>>>> etc.
>>>> - the sync puts would be faster, as the data gets only serialized
>>>> (and
>>>> doesn't get deserialized at the other end).
>>>> - ???
>>>> Cons:
>>>> - data would be deserialized for each get request, adding a latency.
>>>> Partially compensated by faster puts (see cons) and can be mitigated
>>>> by using L1 caches (near caches)
>>>> - ???
>>>>
>>>> Well I'm not even sure that this fits with our actual architecture,
>>>> just brought this in for brainstorming :)
>>>>
>>>> Cheers,
>>>> Mircea
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Manik Surtani
>>> manik at jboss.org
>>> Lead, Infinispan
>>> Lead, JBoss Cache
>>> http://www.infinispan.org
>>> http://www.jbosscache.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

-- 
Michael D Neale
home: www.michaelneale.net
blog: michaelneale.blogspot.com