[infinispan-dev] Compressing Marshaller Wrapper
philippe van dyck
pvdyck at gmail.com
Fri Feb 26 09:01:14 EST 2010
Agreed, no urgency at all.
But basically, you see two different type of similar API parameters here : streams (of bytes) and byte arrays.
IMHO a simple way to refactor Marshaller would be to create StreamMarshaller and ByteArrayMarshaller.
And then it is very possible that StreamMarshaller will simply fill a byte array (or use the one in a ByteArray*Stream) and call ByteArrayMarshaller.
But right now most users of Marshaller use this API asymmetrically mixing byte arrays and streams... it lowers the architecture quality ;-)
If you only offer the stream interface (and everybody wraps their byte arrays in streams themselves) then it is very easy to add a lot of useful stream processors (like versioning / compressing / logging).
Ultimately, if you only use streams you improve concurrency too, since streams pull data out of pipes (like http connectors / file readers) and the thread waits until closed.
Then it is not so difficult to play bob the plumber like this :
FileInputStream* > GZipInputStream* > VersionAwareInputStream > KeepACopyInTheCacheOfAllBytesGoingThroughThisInputStream > CacheStreamApi*
cheers
phi
*already exists
Le 26 févr. 2010 à 13:35, Galder Zamarreno a écrit :
> On Fri, 26 Feb 2010 12:36:57 +0100, philippe van dyck <pvdyck at gmail.com>
> wrote:
>
>> Thanks for the reentrant scenario Galder.
>>
>> https://jira.jboss.org/jira/browse/ISPN-357 is now closed.
>>
>> If the Marshaller is used for something else than storing cache entries,
>> I don't think it is a good idea to implement compression at this level.
>
> To clarify, the marshaller is used for, well, marshalling (and
> unmarshalling) :) and the marshalling is used for the following use cases:
> - Marshall/unmarshall objects to wire format for sending them to other
> nodes in the cluster.
> - Marshall/unmarshall objects to wire format for storing them in cache
> stores.
> - Marshall/unmarshall objects to wire format for storing them as byte[] in
> the cache. This enables lazy deserialization.
>
> Now, we use the same marshaller instance for all 3 use cases, which
> somehow explains why the API is maybe not as easy to use at first glance.
> Some of the methods are more oriented at maybe reading from streams, such
> as file streams, whereas others simply transform it all to byte[]. As
> Manik said, this is a bit of legacy API coming from the JBC days. I do
> remember looking at it and thinking whether it could be simplified
> somehow, but didn't looked into it too much since it's mostly an internal
> API. This is something that might make sense doing at some point. I don't
> think it's urgent though.
>
>>
>> Compression is cpu intensive, and it may be a good idea to "prepare"
>> entries in memory (with a low priority thread), like adding a
>> "compressed" flag to a cache entry.
>> This way, they are ready for storage or transfer... they consume less
>> memory, but they cost much more to use (decompression time).
>>
>> In fact, it is a very old tradeoff and IMO if compression should be
>> integrated in Infinispan, it is at a higher level -- and another
>> discussion.
>>
>>> From my point of view, S3 entries are now compressed and cost less to
>>> transfer and store, it was my initial goal.
>>
>> cheers,
>>
>> phil
>>
>>
>>
>> Le 26 févr. 2010 à 11:16, Galder Zamarreno a écrit :
>>
>>> On Thu, 25 Feb 2010 12:02:34 +0100, philippe van dyck <pvdyck at gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Currently, I compress all data before sending it to the cache. Once
>>>> compressed, I gain 95% of the JSonized qi4j objects.
>>>>
>>>> I did some profiling during the load tests and compression is taking
>>>> roughly 80% of the cpu time.
>>>> So I would like to compress only the data sent to the store, not in
>>>> memory.
>>>>
>>>> Looks like the Marshaller is my friend here, and I plan to write a
>>>> compressing wrapper around it.
>>>>
>>>> Now, when I look at it, I see two ways to wrap the compression process.
>>>>
>>>> One way is with the ObjectInput / ObjectOutput but I am bothered by the
>>>> reentrant flag.
>>>
>>> As a side note, the reentrant flag is used to signal the marshaller
>>> whether several ObjectOutput/ObjectInput as open without a close, i.e.
>>> --
>>> marshaller.startObjectOutput(x, false)
>>> marshaller.startObjectOutput(x, true) -> is reentrant, so mark it as
>>> such
>>> --
>>> marshaller.startObjectOutput(x, false)
>>> marshaller.finishObjectOutput()
>>> marshaller.startObjectOutput(x, false) -> not reentrant
>>> marshaller.finishObjectOutput()
>>> --
>>>
>>> Why do we use this? To enable marshaller implementations to return a
>>> different ObjectOutput if the call is reentrant. If you look at
>>> org.infinispan.marshall.jboss.JBossMarshaller you see that the
>>> ObjectOutput (or org.jboss.marshalling.Marshaller) is a ThreadLocal, but
>>> JBossMarshaller does not allow for the same
>>> org.jboss.marshalling.Marshaller to be opened twice. So, by using the
>>> reentrant flag, we can make sure that the 2nd time that
>>> startObjectOutput
>>> is called, a different one is provided.
>>>
>>> For an example of reentrancy, see the javadoc:
>>>
>>> * <p>On the other hand, when a call is reentrant, i.e.
>>> startObjectOutput/startObjectOutput(reentrant)...finishObjectOutput/finishObjectOutput,
>>> * the Marshaller implementation might treat it differently. An
>>> example
>>> of reentrancy would be marshalling of {@link MarshalledValue}.
>>> * When sending or storing a MarshalledValue, a call to
>>> startObjectOutput() would occur so that the stream is open and
>>> * following, a 2nd call could occur so that MarshalledValue's raw
>>> byte
>>> array version is calculated and sent accross.
>>> * This enables lazy deserialization on the receiver side which is
>>> performance gain. The Marshaller implementation could decide
>>> * that it needs a separate ObjectOutput or similar for the 2nd call
>>> since it's aim is only to get the raw byte array version
>>> * and the close finish with it.</p>
>>>
>>> The second reentrant call is the one to create the MarshalledValue form
>>> of
>>> the in memory data. The first call would be the stream opened to send
>>> the
>>> put or get or whichever op you're sending around.
>>>
>>> As a side note, using ThreadLocal is a much cleaner solution to having
>>> to
>>> maintain a pool of org.jboss.marshalling.Marshaller instances.
>>>
>>> Hope this clarifies further what the reentrant stuff does.
>>>
>>> Cheers,
>>>
>>>> The other is the ByteBuffer stuff, no concurrency problem here, but it
>>>> looks like more work.
>>>>
>>>> WDYT ?
>>>>
>>>> Cheers,
>>>>
>>>> phil
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>> --
>>> Galder Zamarreño
>>> Sr. Software Engineer
>>> Infinispan, JBoss Cache
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
More information about the infinispan-dev
mailing list