On Wed, Oct 8, 2014 at 6:19 PM, Radim Vansa <rvansa(a)redhat.com> wrote:
Users expect that size() will be constant-time (or linear to cluster
size), and generally fast operation. I'd prefer to keep it that way.
Though, even the MR way (used for HotRod size() now) needs to crawl
through all the entries locally.
They might expect that, but there is nothing in the Map API suggesting it.
'Heretic, not very well though of and changing too many things' idea:
what about having data container segment-aware? Then you'd just bcast
SizeCommand with given topologyId and sum up sizes of primary-owned
segments... It's not a complete solution, but at least that would enable
to get the number of locally owned entries quite fast. Though, you can't
do that easily with cache stores (without changing SPI).
We could create a separate DataContainer for each segment. But would it
really be worth the trouble? I don't know of anyone using size() for
something other than checking that their data was properly loaded into the
cache, and they don't need a super-fast size() for that.
Regarding cache stores, IMO we're damned anyway: when calling
cacheStore.size(), it can report more entries as those haven't been
expired yet, it can report less entries as those can be expired due to
[1]. Or, we'll enumerate all the entries, and that's going to be slow
(btw., [1] reminded me that we should enumerate both datacontainer AND
cachestores even if passivation is not enabled).
Exactly, we need to iterate all the entries from the stores if we want
something remotely accurate (although I had forgotten about expiration also
being a problem). Otherwise we could just leave size() as it is now, it's
pretty fast :)
Radim
[1]
https://issues.jboss.org/browse/ISPN-3202
On 10/08/2014 04:42 PM, William Burns wrote:
> So it seems we would want to change this for 7.0 if possible since it
> would be a bigger change for something like 7.1 and 8.0 would be even
> further out. I should be able to put this together for CR2.
>
> It seems that we want to implement keySet, values and entrySet methods
> using the entry iterator approach.
>
> It is however unclear for the size method if we want to use MR entry
> counting and not worry about the rehash and passivation issues since
> it is just an estimation anyways. Or if we want to also use the entry
> iterator which should be closer approximation but will require more
> network overhead and memory usage.
>
> Also we didn't really talk about the fact that these methods would
> ignore ongoing transactions and if that is a concern or not.
>
> - Will
>
> On Wed, Oct 8, 2014 at 10:13 AM, Mircea Markus <mmarkus(a)redhat.com>
wrote:
>> On Oct 8, 2014, at 15:11, Dan Berindei <dan.berindei(a)gmail.com> wrote:
>>
>>> On Wed, Oct 8, 2014 at 5:03 PM, Mircea Markus <mmarkus(a)redhat.com>
wrote:
>>> On Oct 3, 2014, at 9:30, Radim Vansa <rvansa(a)redhat.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> recently we had a discussion about what size() returns, but I've
>>>> realized there are more things that users would like to know. My
>>>> question is whether you think that they would really appreciate it, or
>>>> whether it's just my QA point of view where I sometimes compute the
>>>> 'checksums' of cache to see if I didn't lost anything.
>>>>
>>>> There are those sizes:
>>>> A) number of owned entries
>>>> B) number of entries stored locally in memory
>>>> C) number of entries stored in each local cache store
>>>> D) number of entries stored in each shared cache store
>>>> E) total number of entries in cache
>>>>
>>>> So far, we can get
>>>> B via withFlags(SKIP_CACHE_LOAD).size()
>>>> (passivation ? B : 0) + firstNonZero(C, D) via size()
>>>> E via distributed iterators / MR
>>>> A via data container iteration + distribution manager query, but only
>>>> without cache store
>>>> C or D through
>>>>
getComponentRegistry().getLocalComponent(PersistenceManager.class).getStores()
>>>>
>>>> I think that it would go along with users' expectations if size()
>>>> returned E and for the rest we should have special methods on
>>>> AdvancedCache. That would of course change the meaning of size(), but
>>>> I'd say that finally to something that has firm meaning.
>>>>
>>>> WDYT?
>>> There was a lot of arguments in past whether size() and other methods
that operate over all the elements (keySet, values) are useful because:
>>> - they are approximate (data changes during iteration)
>>> - they are very resource consuming and might be miss-used (this is the
reason we chosen to use size() with its current local semantic)
>>>
>>> These methods (size, keys, values) are useful for people and I think
we were not wise to implement them only on top of the local data: this is
like preferring efficiency over correctness. This also created a lot of
confusion with our users, question like size() doesn't return the correct
value being asked regularly. I totally agree that size() returns E (i.e.
everything that is stored within the grid, including persistence) and it's
performance implications to be documented accordingly. For keySet and
values - we should stop implementing them (throw exception) and point users
to Will's distributed iterator which is a nicer way to achieve the desired
behavior.
>>>
>>> We can also implement keySet() and values() on top of the distributed
entry iterator and document that using the iterator directly is better.
>> Yes, that's what I meant as well.
>>
>> Cheers,
>> --
>> Mircea Markus
>> Infinispan lead (
www.infinispan.org)
>>
>>
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Radim Vansa <rvansa(a)redhat.com>
JBoss DataGrid QA
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev