Re: [infinispan-dev] About size()

Friday, 10 October 2014

Exactly, in a monitoring application you wouldn't need the exact number of
key-value mappings in the cache.

The number of entries in memory and/or on disk should be much more
interesting, and we wouldn't have to worry about duplicated/missing/expired
entries to show that.

On Fri, Oct 10, 2014 at 5:06 PM, Tristan Tarrant <ttarrant(a)redhat.com&gt;
wrote:

...
 What's wrong with sum(Datacontainer.size())/numOwners ?

 Tristan

 On 10/10/14 16:03, Radim Vansa wrote:
 > On 10/10/2014 02:38 PM, William Burns wrote:
 >> On Wed, Oct 8, 2014 at 11:19 AM, Radim Vansa <rvansa(a)redhat.com&gt; wrote:
 >>> Users expect that size() will be constant-time (or linear to cluster
 >>> size), and generally fast operation. I'd prefer to keep it that way.
 >>> Though, even the MR way (used for HotRod size() now) needs to crawl
 >>> through all the entries locally.
 >> Many in memory collections require O(n) to do size such as
 >> ConcurrentLinkedQueue, so I wouldn't say size should always be
 >> expected to be constant time or O(c) where c is # of nodes.  Granted a
 >> user can expect anything they want.
 > OK, I stand corrected. Moreover, I was generalizing myself to all users,
 > a common mistake :)
 >
 > Anyway, monitoring tools love nice charts, and I can imagine monitoring
 > software polling every 1 second to update that cool chart with cache
 > size. Do we want a fast but imprecise variant of this operation in some
 > statistics class?
 >
 > Radim
 >
 >>> 'Heretic, not very well though of and changing too many things'
idea:
 >>> what about having data container segment-aware? Then you'd just bcast
 >>> SizeCommand with given topologyId and sum up sizes of primary-owned
 >>> segments... It's not a complete solution, but at least that would
 enable
 >>> to get the number of locally owned entries quite fast. Though, you
 can't
 >>> do that easily with cache stores (without changing SPI).
 >>>
 >>> Regarding cache stores, IMO we're damned anyway: when calling
 >>> cacheStore.size(), it can report more entries as those haven't been
 >>> expired yet, it can report less entries as those can be expired due to
 >>> [1]. Or, we'll enumerate all the entries, and that's going to be
slow
 >>> (btw., [1] reminded me that we should enumerate both datacontainer AND
 >>> cachestores even if passivation is not enabled).
 >> This is precisely what the distributed iterator does.  And also
 >> support for expired entries was recently integrated as I missed that
 >> in the original implementation [a]
 >>
 >> [a] https://issues.jboss.org/browse/ISPN-4643
 >>
 >>> Radim
 >>>
 >>> [1] https://issues.jboss.org/browse/ISPN-3202
 >>>
 >>> On 10/08/2014 04:42 PM, William Burns wrote:
 >>>> So it seems we would want to change this for 7.0 if possible since it
 >>>> would be a bigger change for something like 7.1 and 8.0 would be even
 >>>> further out.  I should be able to put this together for CR2.
 >>>>
 >>>> It seems that we want to implement keySet, values and entrySet methods
 >>>> using the entry iterator approach.
 >>>>
 >>>> It is however unclear for the size method if we want to use MR entry
 >>>> counting and not worry about the rehash and passivation issues since
 >>>> it is just an estimation anyways.  Or if we want to also use the entry
 >>>> iterator which should be closer approximation but will require more
 >>>> network overhead and memory usage.
 >>>>
 >>>> Also we didn't really talk about the fact that these methods would
 >>>> ignore ongoing transactions and if that is a concern or not.
 >>>>
 >>>>     - Will
 >>>>
 >>>> On Wed, Oct 8, 2014 at 10:13 AM, Mircea Markus
<mmarkus(a)redhat.com&gt;
 wrote:
 >>>>> On Oct 8, 2014, at 15:11, Dan Berindei
<dan.berindei(a)gmail.com&gt;
 wrote:
 >>>>>
 >>>>>> On Wed, Oct 8, 2014 at 5:03 PM, Mircea Markus
<mmarkus(a)redhat.com&gt;
 wrote:
 >>>>>> On Oct 3, 2014, at 9:30, Radim Vansa <rvansa(a)redhat.com&gt;
wrote:
 >>>>>>
 >>>>>>> Hi,
 >>>>>>>
 >>>>>>> recently we had a discussion about what size() returns, but
I've
 >>>>>>> realized there are more things that users would like to
know. My
 >>>>>>> question is whether you think that they would really
appreciate
 it, or
 >>>>>>> whether it's just my QA point of view where I sometimes
compute the
 >>>>>>> 'checksums' of cache to see if I didn't lost
anything.
 >>>>>>>
 >>>>>>> There are those sizes:
 >>>>>>> A) number of owned entries
 >>>>>>> B) number of entries stored locally in memory
 >>>>>>> C) number of entries stored in each local cache store
 >>>>>>> D) number of entries stored in each shared cache store
 >>>>>>> E) total number of entries in cache
 >>>>>>>
 >>>>>>> So far, we can get
 >>>>>>> B via withFlags(SKIP_CACHE_LOAD).size()
 >>>>>>> (passivation ? B : 0) + firstNonZero(C, D) via size()
 >>>>>>> E via distributed iterators / MR
 >>>>>>> A via data container iteration + distribution manager query,
but
 only
 >>>>>>> without cache store
 >>>>>>> C or D through
 >>>>>>>
 getComponentRegistry().getLocalComponent(PersistenceManager.class).getStores()
 >>>>>>>
 >>>>>>> I think that it would go along with users' expectations
if size()
 >>>>>>> returned E and for the rest we should have special methods
on
 >>>>>>> AdvancedCache. That would of course change the meaning of
size(),
 but
 >>>>>>> I'd say that finally to something that has firm
meaning.
 >>>>>>>
 >>>>>>> WDYT?
 >>>>>> There was a lot of arguments in past whether size() and other
 methods that operate over all the elements (keySet, values) are useful
 because:
 >>>>>> - they are approximate (data changes during iteration)
 >>>>>> - they are very resource consuming and might be miss-used (this
is
 the reason we chosen to use size() with its current local semantic)
 >>>>>>
 >>>>>> These methods (size, keys, values) are useful for people and I
 think we were not wise to implement them only on top of the local data:
 this is like preferring efficiency over correctness. This also created a
 lot of confusion with our users, question like size() doesn't return the
 correct value being asked regularly. I totally agree that size() returns E
 (i.e. everything that is stored within the grid, including persistence) and
 it's performance implications to be documented accordingly. For keySet and
 values - we should stop implementing them (throw exception) and point users
 to Will's distributed iterator which is a nicer way to achieve the desired
 behavior.
 >>>>>>
 >>>>>> We can also implement keySet() and values() on top of the
 distributed entry iterator and document that using the iterator directly is
 better.
 >>>>> Yes, that's what I meant as well.
 >>>>>
 >>>>> Cheers,
 >>>>> --
 >>>>> Mircea Markus
 >>>>> Infinispan lead (www.infinispan.org)
 >>>>>
 >>>>>
 >>>>>
 >>>>>
 >>>>>
 >>>>> _______________________________________________
 >>>>> infinispan-dev mailing list
 >>>>> infinispan-dev(a)lists.jboss.org
 >>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >>>> _______________________________________________
 >>>> infinispan-dev mailing list
 >>>> infinispan-dev(a)lists.jboss.org
 >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >>> --
 >>> Radim Vansa <rvansa(a)redhat.com&gt;
 >>> JBoss DataGrid QA
 >>>
 >>> _______________________________________________
 >>> infinispan-dev mailing list
 >>> infinispan-dev(a)lists.jboss.org
 >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >> _______________________________________________
 >> infinispan-dev mailing list
 >> infinispan-dev(a)lists.jboss.org
 >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
 >

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] About size()