Re: [infinispan-dev] Compressing Marshaller Wrapper

Friday, 26 February 2010

Agreed, no urgency at all.

But basically, you see two different type of similar API parameters here : streams (of
bytes) and byte arrays.
IMHO a simple way to refactor Marshaller would be to create StreamMarshaller and
ByteArrayMarshaller.
And then it is very possible that StreamMarshaller will simply fill a byte array (or use
the one in a ByteArray*Stream) and call ByteArrayMarshaller.

But right now most users of Marshaller use this API asymmetrically mixing byte arrays and
streams...  it lowers the architecture quality ;-)

If you only offer the stream interface (and everybody wraps their byte arrays in streams
themselves) then it is very easy to add a lot of useful stream processors (like versioning
/ compressing / logging).

Ultimately, if you only use streams you improve concurrency too, since streams pull data
out of pipes (like http connectors / file readers) and the thread waits until closed. 

Then it is not so difficult to play bob the plumber like this :

FileInputStream* > GZipInputStream* > VersionAwareInputStream >
KeepACopyInTheCacheOfAllBytesGoingThroughThisInputStream > CacheStreamApi* 

cheers

phi

*already exists

Le 26 févr. 2010 à 13:35, Galder Zamarreno a écrit :

...
 On Fri, 26 Feb 2010 12:36:57 +0100, philippe van dyck
<pvdyck(a)gmail.com&gt;  
 wrote:

> Thanks for the reentrant scenario Galder.
> 
> https://jira.jboss.org/jira/browse/ISPN-357 is now closed.
> 
> If the Marshaller is used for something else than storing cache entries,  
> I don't think it is a good idea to implement compression at this level.

 To clarify, the marshaller is used for, well, marshalling (and  
 unmarshalling) :) and the marshalling is used for the following use cases:
 - Marshall/unmarshall objects to wire format for sending them to other  
 nodes in the cluster.
 - Marshall/unmarshall objects to wire format for storing them in cache  
 stores.
 - Marshall/unmarshall objects to wire format for storing them as byte[] in  
 the cache. This enables lazy deserialization.

 Now, we use the same marshaller instance for all 3 use cases, which  
 somehow explains why the API is maybe not as easy to use at first glance.  
 Some of the methods are more oriented at maybe reading from streams, such  
 as file streams, whereas others simply transform it all to byte[]. As  
 Manik said, this is a bit of legacy API coming from the JBC days. I do  
 remember looking at it and thinking whether it could be simplified  
 somehow, but didn't looked into it too much since it's mostly an internal  
 API. This is something that might make sense doing at some point. I don't  
 think it's urgent though.

> 
> Compression is cpu intensive, and it may be a good idea to "prepare"  
> entries in memory (with a low priority thread), like adding a  
> "compressed" flag to a cache entry.
> This way, they are ready for storage or transfer... they consume less  
> memory, but they cost much more to use (decompression time).
> 
> In fact, it is a very old tradeoff and IMO if compression should be  
> integrated in Infinispan, it is at a higher level -- and another  
> discussion.
> 
>> From my point of view, S3 entries are now compressed and cost less to  
>> transfer and store, it was my initial goal.
> 
> cheers,
> 
> phil
> 
> 
> 
> Le 26 févr. 2010 à 11:16, Galder Zamarreno a écrit :
> 
>> On Thu, 25 Feb 2010 12:02:34 +0100, philippe van dyck <pvdyck(a)gmail.com&gt;
>> wrote:
>> 
>>> Hi All,
>>> 
>>> Currently, I compress all data before sending it to the cache. Once
>>> compressed, I gain 95% of the JSonized qi4j objects.
>>> 
>>> I did some profiling during the load tests and compression is taking
>>> roughly 80% of the cpu time.
>>> So I would like to compress only the data sent to the store, not in
>>> memory.
>>> 
>>> Looks like the Marshaller is my friend here, and I plan to write a
>>> compressing wrapper around it.
>>> 
>>> Now, when I look at it, I see two ways to wrap the compression process.
>>> 
>>> One way is with the ObjectInput / ObjectOutput but I am bothered by the
>>> reentrant flag.
>> 
>> As a side note, the reentrant flag is used to signal the marshaller
>> whether several ObjectOutput/ObjectInput as open without a close, i.e.
>> --
>> marshaller.startObjectOutput(x, false)
>> marshaller.startObjectOutput(x, true) -> is reentrant, so mark it as  
>> such
>> --
>> marshaller.startObjectOutput(x, false)
>> marshaller.finishObjectOutput()
>> marshaller.startObjectOutput(x, false) -> not reentrant
>> marshaller.finishObjectOutput()
>> --
>> 
>> Why do we use this? To enable marshaller implementations to return a
>> different ObjectOutput if the call is reentrant. If you look at
>> org.infinispan.marshall.jboss.JBossMarshaller you see that the
>> ObjectOutput (or org.jboss.marshalling.Marshaller) is a ThreadLocal, but
>> JBossMarshaller does not allow for the same
>> org.jboss.marshalling.Marshaller to be opened twice. So, by using the
>> reentrant flag, we can make sure that the 2nd time that  
>> startObjectOutput
>> is called, a different one is provided.
>> 
>> For an example of reentrancy, see the javadoc:
>> 
>>    * <p>On the other hand, when a call is reentrant, i.e.
>>
startObjectOutput/startObjectOutput(reentrant)...finishObjectOutput/finishObjectOutput,
>>    * the Marshaller implementation might treat it differently. An  
>> example
>> of reentrancy would be marshalling of {@link MarshalledValue}.
>>    * When sending or storing a MarshalledValue, a call to
>> startObjectOutput() would occur so that the stream is open and
>>    * following, a 2nd call could occur so that MarshalledValue's raw  
>> byte
>> array version is calculated and sent accross.
>>    * This enables lazy deserialization on the receiver side which is
>> performance gain. The Marshaller implementation could decide
>>    * that it needs a separate ObjectOutput or similar for the 2nd call
>> since it's aim is only to get the raw byte array version
>>    * and the close finish with it.</p>
>> 
>> The second reentrant call is the one to create the MarshalledValue form  
>> of
>> the in memory data. The first call would be the stream opened to send  
>> the
>> put or get or whichever op you're sending around.
>> 
>> As a side note, using ThreadLocal is a much cleaner solution to having  
>> to
>> maintain a pool of org.jboss.marshalling.Marshaller instances.
>> 
>> Hope this clarifies further what the reentrant stuff does.
>> 
>> Cheers,
>> 
>>> The other is the ByteBuffer stuff, no concurrency problem here, but it
>>> looks like more work.
>>> 
>>> WDYT ?
>>> 
>>> Cheers,
>>> 
>>> phil
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev(a)lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> --
>> Galder Zamarreño
>> Sr. Software Engineer
>> Infinispan, JBoss Cache
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

 -- 
 Galder Zamarreño
 Sr. Software Engineer
 Infinispan, JBoss Cache

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] Compressing Marshaller Wrapper