[infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Fri Apr 15 10:48:39 EDT 2011

+1, I don't think we should over-complicate by mandating that chunks are on different nodes.  Let the distribution code handle this.  If, at a later date, we see that the system frequently fails due to too many chunks on certain nodes, we can revisit.  But that would be an implementation detail.

Personally, I think with virtual nodes, we should have a high chance of distribution taking care of this.

On 4 Apr 2011, at 10:43, Sanne Grinovero wrote:

> I don't think you should make it too complex by looking at available
> memory, you have the same issue when storing many different keys in
> Infinispan in any mode, but we never worry about this, relying instead
> on the spreading quality of the hash function, and of course the
> available total heap size must be able to store all values, plus the
> replicas, plus some extra % due to the hashing function not being
> perfect; In effect you can always define some spill-over to
> CacheLoaders.
> 
> The fact that some nodes will have less memory available will be
> solved by the virtual nodes patch, if you refer to bigger vs. smaller
> machines in the same cluster.
> 
> If you make sure the file is split in "many" chunks, they will be
> randomly distributed and that should be good enough for this purpose,
> wherein the definition of "many" can be a configuration option, or a
> method parameter during store.
> 
> There's something similar happening in the Lucene Directory code,
> these are some issues I had to consider:
> 
> 1) make sure you store a metadata object with the used configuration
> details, like the number and size of chunks, so that in case the chunk
> size is configurable, if the cluster is restarted with a different
> configuration you are still able to retrieve the correct stream.
> 
> 2) There might be concurrency issues while one thread/node is
> streaming it, and another one is deleting or replacing it. Infinispan
> provides you with consistency at a key level, but as you're dealing
> with multiple keys, you might get a view composed of chunks from
> different transactions.
> 
> You'll have to think about how to solve 2), I guess you could store a
> version number in the metadata object mentioned in 1) and have all
> modified keys contain the version they refer to. garbage collection
> would be tricky, as at some point you want to delete chunks no longer
> referred to by any node, including those who crashed without
> explicitly releasing anything.
> 
> Sanne
> 
> 
> 2011/4/4 Galder Zamarreño <galder at redhat.com>:
>> 
>> On Mar 31, 2011, at 7:46 AM, Olaf Bergner wrote:
>> 
>>> Am 30.03.11 02:32, schrieb Elias Ross:
>>>> I think it'd be BEST if you could support both models. I would add:
>>>> 
>>>> interface Cache {
>>>>   /**
>>>>    * Returns a new or existing LargeObject object for the following key.
>>>>    * @throws ClassCastException if the key exists and is not a LargeObject.
>>>>    */
>>>>   LargeObject largeObject(K key);
>>>> }
>>> OK, I'll keep that on my todo list, yet for the time being I'v opted to
>>> start with implementing void writeToKey(K key, InputStream largeObject).
>>>>> This is certainly doable but leaves me wondering where that proposed
>>>>> ChunkingInterceptor might come into play.
>>>> I would think ideally you don't need to create any new commands. Less
>>>> protocol messages is better.
>>> It is my understanding that PutKeyValueCommand will *always* attempt to
>>> read the current value stored under the given key first. I'm not sure if
>>> we want this in our situation where that current value may be several GB
>>> in size. Anyway, it should be easy to refactor if reusing
>>> PutKeyValueCommand should prove viable.
>> 
>> The only reason it reads the previous value is to return it as part of contract of "V put(K, V)" - but that can be skipped.
>> 
>>>>> 3. The design suggests to use a fresh UUID as the key for each new
>>>>> chunk. While this in all likelihood gives us a unique new key for each
>>>>> chunk I currently fail to see how that guarantees that this key maps to
>>>>> a node that is different from all the nodes already used to store chunks
>>>>> of the same Large Object. But then again I know next to nothing about
>>>>> Infinispan's constant hashing algorithm.
>>>> I wouldn't use UUID. I'd just store (K, #) where # is the chunk.
>>>> 
>>> Since this is important and might reveal a fundamental misunderstanding
>>> on my part, I need to sort this out before moving on. These are my
>>> assumptions, please point out any errors:
>>> 
>>> 1. We want to partition a large object into chunks since, by definition,
>>> a large object is too big to be stored in a single node in the cluster.
>>> It follows that it is paramount that no two chunks be stored in the same
>>> node, correct?
>> 
>> No. The idea is that the whole object should not end up being stored in a single JVM, but nothing should stop you from storing two chunks of the same object in the same node.
>> 
>> What we somehow need to avoid is chunks ending up in nodes that do not have enough memory to store them, and that could complicate things.
>> 
>>> 
>>> 2. Constant hashing guarantees that any given key maps to *some* node in
>>> the cluster. There is no way, however, such a key's creator could know
>>> to what node exactly its key maps. In other words, there is no inverse
>>> to the hash function, correct?
>> 
>> I vaguely remember something about a consistent hash algorithm that given a node where to store data, it would generate a key for it (Mircea, did you create this?). This could work in conjunction with my previous point assuming that a node would know what the available memory in other nodes is, but this would require some thinking.
>> 
>> 
>>> 
>>> 3. The current design mandates that for storing each chunk the existing
>>> put(key, value) be reused, correct?
>>> 
>>> It follows that we have no way whatsoever of generating a set of keys
>>> that guarantees that no two keys are mapped to the same node. In the
>>> pathological case, *all* keys map to the same node, correct?
>> 
>> See my previous point.
>> 
>> 
>>>>> I would think a use case for this API would be streaming audio or
>>>>> video, maybe something like access logs even?
>>>>> 
>>>>> In which case, you would want to read while you're writing. So,
>>>>> locking shouldn't be imposed. I would say, rely on the transaction
>>>>> manager to keep a consistent view. If transactions aren't being used,
>>>>> then the user might see some unexpected behavior. The API could
>>>>> compensate for that.
>>>>> 
>>> If I understand you correctly you propose two alternatives:
>>> 
>>> 1. Use transactions, thus delegating all consistency requirements to the
>>> transaction manager.
>>> 
>>> 2. Don't use transactions and change the API so that readers may be told
>>> that a large object they are interested in is currently being written.
>>> 
>>> Further, to support streaming use cases you propose that it should be
>>> possible to read a large object while it is being written.
>>> 
>>> Is that correct?
>>> 
>>> Hmm, I need to think about this. If I understand Manik's comment and the
>>> tx subsystem correctly each transaction holds its *entire* associated
>>> state in memory. Thus, if we are to write all chunks of a given large
>>> object within the scope of a single transaction we will blow up the
>>> originator node's heap. Correct?
>> 
>> Hmmmm, maybe what's needed here is a mix of the two. You want metadata information to be transactional, so when you start writing and chunking an object and you keep updating the metadata object, this is transactionally protected, so no one can read the metadata in the mean time, however, the actual chunk writing in the cache could be non-transactional to make chunks do not pile up in the transaction context.
>> 
>>> 
>>> So many questions ...
>>> 
>>> Cheers,
>>> Olaf
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> --
>> Galder Zamarreño
>> Sr. Software Engineer
>> Infinispan, JBoss Cache
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
manik at jboss.org
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org