[infinispan-dev] About size()

Thu Oct 9 11:46:44 EDT 2014

On Oct 8, 2014, at 16:19, Radim Vansa <rvansa at redhat.com> wrote:

> Users expect that size() will be constant-time (or linear to cluster 
> size), and generally fast operation. I'd prefer to keep it that way. 
> Though, even the MR way (used for HotRod size() now) needs to crawl 
> through all the entries locally.

yes, but first of all they expect size to be correct, then fast. 

> 
> 'Heretic, not very well though of and changing too many things' idea: 
> what about having data container segment-aware? Then you'd just bcast 
> SizeCommand with given topologyId and sum up sizes of primary-owned 
> segments... It's not a complete solution, but at least that would enable 
> to get the number of locally owned entries quite fast. Though, you can't 
> do that easily with cache stores (without changing SPI).

that would help and there were discussions to do this for other reasons as well: e.g. ST would migrate the data without iterating over the state in the DC.
Not doable in the scope of ISPN 7.0, though.

> 
> Regarding cache stores, IMO we're damned anyway: when calling 
> cacheStore.size(), it can report more entries as those haven't been 
> expired yet, it can report less entries as those can be expired due to 
> [1]. Or, we'll enumerate all the entries, and that's going to be slow 
> (btw., [1] reminded me that we should enumerate both datacontainer AND 
> cachestores even if passivation is not enabled).
> 
> Radim
> 
> [1] https://issues.jboss.org/browse/ISPN-3202
> 
> On 10/08/2014 04:42 PM, William Burns wrote:
>> So it seems we would want to change this for 7.0 if possible since it
>> would be a bigger change for something like 7.1 and 8.0 would be even
>> further out.  I should be able to put this together for CR2.
>> 
>> It seems that we want to implement keySet, values and entrySet methods
>> using the entry iterator approach.
>> 
>> It is however unclear for the size method if we want to use MR entry
>> counting and not worry about the rehash and passivation issues since
>> it is just an estimation anyways.  Or if we want to also use the entry
>> iterator which should be closer approximation but will require more
>> network overhead and memory usage.
>> 
>> Also we didn't really talk about the fact that these methods would
>> ignore ongoing transactions and if that is a concern or not.
>> 
>>  - Will
>> 
>> On Wed, Oct 8, 2014 at 10:13 AM, Mircea Markus <mmarkus at redhat.com> wrote:
>>> On Oct 8, 2014, at 15:11, Dan Berindei <dan.berindei at gmail.com> wrote:
>>> 
>>>> On Wed, Oct 8, 2014 at 5:03 PM, Mircea Markus <mmarkus at redhat.com> wrote:
>>>> On Oct 3, 2014, at 9:30, Radim Vansa <rvansa at redhat.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> recently we had a discussion about what size() returns, but I've
>>>>> realized there are more things that users would like to know. My
>>>>> question is whether you think that they would really appreciate it, or
>>>>> whether it's just my QA point of view where I sometimes compute the
>>>>> 'checksums' of cache to see if I didn't lost anything.
>>>>> 
>>>>> There are those sizes:
>>>>> A) number of owned entries
>>>>> B) number of entries stored locally in memory
>>>>> C) number of entries stored in each local cache store
>>>>> D) number of entries stored in each shared cache store
>>>>> E) total number of entries in cache
>>>>> 
>>>>> So far, we can get
>>>>> B via withFlags(SKIP_CACHE_LOAD).size()
>>>>> (passivation ? B : 0) + firstNonZero(C, D) via size()
>>>>> E via distributed iterators / MR
>>>>> A via data container iteration + distribution manager query, but only
>>>>> without cache store
>>>>> C or D through
>>>>> getComponentRegistry().getLocalComponent(PersistenceManager.class).getStores()
>>>>> 
>>>>> I think that it would go along with users' expectations if size()
>>>>> returned E and for the rest we should have special methods on
>>>>> AdvancedCache. That would of course change the meaning of size(), but
>>>>> I'd say that finally to something that has firm meaning.
>>>>> 
>>>>> WDYT?
>>>> There was a lot of arguments in past whether size() and other methods that operate over all the elements (keySet, values) are useful because:
>>>> - they are approximate (data changes during iteration)
>>>> - they are very resource consuming and might be miss-used (this is the reason we chosen to use size() with its current local semantic)
>>>> 
>>>> These methods (size, keys, values) are useful for people and I think we were not wise to implement them only on top of the local data: this is like preferring efficiency over correctness. This also created a lot of confusion with our users, question like size() doesn't return the correct value being asked regularly. I totally agree that size() returns E (i.e. everything that is stored within the grid, including persistence) and it's performance implications to be documented accordingly. For keySet and values - we should stop implementing them (throw exception) and point users to Will's distributed iterator which is a nicer way to achieve the desired behavior.
>>>> 
>>>> We can also implement keySet() and values() on top of the distributed entry iterator and document that using the iterator directly is better.
>>> Yes, that's what I meant as well.
>>> 
>>> Cheers,
>>> --
>>> Mircea Markus
>>> Infinispan lead (www.infinispan.org)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> 
> -- 
> Radim Vansa <rvansa at redhat.com>
> JBoss DataGrid QA
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)