[infinispan-dev] Compressing Marshaller Wrapper

Fri Feb 26 09:01:14 EST 2010

Agreed, no urgency at all.

But basically, you see two different type of similar API parameters here : streams (of bytes) and byte arrays.
IMHO a simple way to refactor Marshaller would be to create StreamMarshaller and ByteArrayMarshaller.
And then it is very possible that StreamMarshaller will simply fill a byte array (or use the one in a ByteArray*Stream) and call ByteArrayMarshaller.

But right now most users of Marshaller use this API asymmetrically mixing byte arrays and streams...  it lowers the architecture quality ;-)

If you only offer the stream interface (and everybody wraps their byte arrays in streams themselves) then it is very easy to add a lot of useful stream processors (like versioning / compressing / logging).

Ultimately, if you only use streams you improve concurrency too, since streams pull data out of pipes (like http connectors / file readers) and the thread waits until closed. 

Then it is not so difficult to play bob the plumber like this :

FileInputStream* > GZipInputStream* > VersionAwareInputStream > KeepACopyInTheCacheOfAllBytesGoingThroughThisInputStream > CacheStreamApi* 

cheers

phi

*already exists

Le 26 févr. 2010 à 13:35, Galder Zamarreno a écrit :

> On Fri, 26 Feb 2010 12:36:57 +0100, philippe van dyck <pvdyck at gmail.com>  
> wrote:
> 
>> Thanks for the reentrant scenario Galder.
>> 
>> https://jira.jboss.org/jira/browse/ISPN-357 is now closed.
>> 
>> If the Marshaller is used for something else than storing cache entries,  
>> I don't think it is a good idea to implement compression at this level.
> 
> To clarify, the marshaller is used for, well, marshalling (and  
> unmarshalling) :) and the marshalling is used for the following use cases:
> - Marshall/unmarshall objects to wire format for sending them to other  
> nodes in the cluster.
> - Marshall/unmarshall objects to wire format for storing them in cache  
> stores.
> - Marshall/unmarshall objects to wire format for storing them as byte[] in  
> the cache. This enables lazy deserialization.
> 
> Now, we use the same marshaller instance for all 3 use cases, which  
> somehow explains why the API is maybe not as easy to use at first glance.  
> Some of the methods are more oriented at maybe reading from streams, such  
> as file streams, whereas others simply transform it all to byte[]. As  
> Manik said, this is a bit of legacy API coming from the JBC days. I do  
> remember looking at it and thinking whether it could be simplified  
> somehow, but didn't looked into it too much since it's mostly an internal  
> API. This is something that might make sense doing at some point. I don't  
> think it's urgent though.
> 
>> 
>> Compression is cpu intensive, and it may be a good idea to "prepare"  
>> entries in memory (with a low priority thread), like adding a  
>> "compressed" flag to a cache entry.
>> This way, they are ready for storage or transfer... they consume less  
>> memory, but they cost much more to use (decompression time).
>> 
>> In fact, it is a very old tradeoff and IMO if compression should be  
>> integrated in Infinispan, it is at a higher level -- and another  
>> discussion.
>> 
>>> From my point of view, S3 entries are now compressed and cost less to  
>>> transfer and store, it was my initial goal.
>> 
>> cheers,
>> 
>> phil
>> 
>> 
>> 
>> Le 26 févr. 2010 à 11:16, Galder Zamarreno a écrit :
>> 
>>> On Thu, 25 Feb 2010 12:02:34 +0100, philippe van dyck <pvdyck at gmail.com>
>>> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> Currently, I compress all data before sending it to the cache. Once
>>>> compressed, I gain 95% of the JSonized qi4j objects.
>>>> 
>>>> I did some profiling during the load tests and compression is taking
>>>> roughly 80% of the cpu time.
>>>> So I would like to compress only the data sent to the store, not in
>>>> memory.
>>>> 
>>>> Looks like the Marshaller is my friend here, and I plan to write a
>>>> compressing wrapper around it.
>>>> 
>>>> Now, when I look at it, I see two ways to wrap the compression process.
>>>> 
>>>> One way is with the ObjectInput / ObjectOutput but I am bothered by the
>>>> reentrant flag.
>>> 
>>> As a side note, the reentrant flag is used to signal the marshaller
>>> whether several ObjectOutput/ObjectInput as open without a close, i.e.
>>> --
>>> marshaller.startObjectOutput(x, false)
>>> marshaller.startObjectOutput(x, true) -> is reentrant, so mark it as  
>>> such
>>> --
>>> marshaller.startObjectOutput(x, false)
>>> marshaller.finishObjectOutput()
>>> marshaller.startObjectOutput(x, false) -> not reentrant
>>> marshaller.finishObjectOutput()
>>> --
>>> 
>>> Why do we use this? To enable marshaller implementations to return a
>>> different ObjectOutput if the call is reentrant. If you look at
>>> org.infinispan.marshall.jboss.JBossMarshaller you see that the
>>> ObjectOutput (or org.jboss.marshalling.Marshaller) is a ThreadLocal, but
>>> JBossMarshaller does not allow for the same
>>> org.jboss.marshalling.Marshaller to be opened twice. So, by using the
>>> reentrant flag, we can make sure that the 2nd time that  
>>> startObjectOutput
>>> is called, a different one is provided.
>>> 
>>> For an example of reentrancy, see the javadoc:
>>> 
>>>    * <p>On the other hand, when a call is reentrant, i.e.
>>> startObjectOutput/startObjectOutput(reentrant)...finishObjectOutput/finishObjectOutput,
>>>    * the Marshaller implementation might treat it differently. An  
>>> example
>>> of reentrancy would be marshalling of {@link MarshalledValue}.
>>>    * When sending or storing a MarshalledValue, a call to
>>> startObjectOutput() would occur so that the stream is open and
>>>    * following, a 2nd call could occur so that MarshalledValue's raw  
>>> byte
>>> array version is calculated and sent accross.
>>>    * This enables lazy deserialization on the receiver side which is
>>> performance gain. The Marshaller implementation could decide
>>>    * that it needs a separate ObjectOutput or similar for the 2nd call
>>> since it's aim is only to get the raw byte array version
>>>    * and the close finish with it.</p>
>>> 
>>> The second reentrant call is the one to create the MarshalledValue form  
>>> of
>>> the in memory data. The first call would be the stream opened to send  
>>> the
>>> put or get or whichever op you're sending around.
>>> 
>>> As a side note, using ThreadLocal is a much cleaner solution to having  
>>> to
>>> maintain a pool of org.jboss.marshalling.Marshaller instances.
>>> 
>>> Hope this clarifies further what the reentrant stuff does.
>>> 
>>> Cheers,
>>> 
>>>> The other is the ByteBuffer stuff, no concurrency problem here, but it
>>>> looks like more work.
>>>> 
>>>> WDYT ?
>>>> 
>>>> Cheers,
>>>> 
>>>> phil
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> 
>>> 
>>> --
>>> Galder Zamarreño
>>> Sr. Software Engineer
>>> Infinispan, JBoss Cache
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> 
> -- 
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev