[infinispan-dev] About size()

Thu Oct 9 02:48:38 EDT 2014

On Wed, Oct 8, 2014 at 7:53 PM, William Burns <mudokonman at gmail.com> wrote:

> On Wed, Oct 8, 2014 at 12:41 PM, Dan Berindei <dan.berindei at gmail.com>
> wrote:
> >
> >
> > On Wed, Oct 8, 2014 at 6:19 PM, Radim Vansa <rvansa at redhat.com> wrote:
> >>
> >> Users expect that size() will be constant-time (or linear to cluster
> >> size), and generally fast operation. I'd prefer to keep it that way.
> >> Though, even the MR way (used for HotRod size() now) needs to crawl
> >> through all the entries locally.
> >
> >
> > They might expect that, but there is nothing in the Map API suggesting
> it.
> >
> >>
> >>
> >> 'Heretic, not very well though of and changing too many things' idea:
> >> what about having data container segment-aware? Then you'd just bcast
> >> SizeCommand with given topologyId and sum up sizes of primary-owned
> >> segments... It's not a complete solution, but at least that would enable
> >> to get the number of locally owned entries quite fast. Though, you can't
> >> do that easily with cache stores (without changing SPI).
> >
> >
> > We could create a separate DataContainer for each segment. But would it
> > really be worth the trouble? I don't know of anyone using size() for
> > something other than checking that their data was properly loaded into
> the
> > cache, and they don't need a super-fast size() for that.
>
> Having a DataContainer per segment would actually reduce required
> memory usage for the distributed iterator as well, since we can query
> data by segment much more efficiently and close out segments one by
> one per node instead of having to keep multiple open at once.  When I
> asked about this before it was kind of a we can deal with it later
> kind thing.  I would think this would increase ST operation time as
> well.
>

You mean it would improve ST performance, because it wouldn't have to
compute the hash of each key in the data container?

I don't think we have ever considered splitting the data container for ST,
as it didn't seem worth the trouble. OTOH we wanted to add a segment-based
query to the cache loader SPI every since we started designing NBST :)

> >
> >>
> >>
> >> Regarding cache stores, IMO we're damned anyway: when calling
> >> cacheStore.size(), it can report more entries as those haven't been
> >> expired yet, it can report less entries as those can be expired due to
> >> [1]. Or, we'll enumerate all the entries, and that's going to be slow
> >> (btw., [1] reminded me that we should enumerate both datacontainer AND
> >> cachestores even if passivation is not enabled).
> >
> >
> > Exactly, we need to iterate all the entries from the stores if we want
> > something remotely accurate (although I had forgotten about expiration
> also
> > being a problem). Otherwise we could just leave size() as it is now, it's
> > pretty fast :)
> >
> >>
> >>
> >> Radim
> >>
> >> [1] https://issues.jboss.org/browse/ISPN-3202
> >>
> >> On 10/08/2014 04:42 PM, William Burns wrote:
> >> > So it seems we would want to change this for 7.0 if possible since it
> >> > would be a bigger change for something like 7.1 and 8.0 would be even
> >> > further out.  I should be able to put this together for CR2.
> >> >
> >> > It seems that we want to implement keySet, values and entrySet methods
> >> > using the entry iterator approach.
> >> >
> >> > It is however unclear for the size method if we want to use MR entry
> >> > counting and not worry about the rehash and passivation issues since
> >> > it is just an estimation anyways.  Or if we want to also use the entry
> >> > iterator which should be closer approximation but will require more
> >> > network overhead and memory usage.
> >> >
> >> > Also we didn't really talk about the fact that these methods would
> >> > ignore ongoing transactions and if that is a concern or not.
> >> >
> >> >   - Will
> >> >
> >> > On Wed, Oct 8, 2014 at 10:13 AM, Mircea Markus <mmarkus at redhat.com>
> >> > wrote:
> >> >> On Oct 8, 2014, at 15:11, Dan Berindei <dan.berindei at gmail.com>
> wrote:
> >> >>
> >> >>> On Wed, Oct 8, 2014 at 5:03 PM, Mircea Markus <mmarkus at redhat.com>
> >> >>> wrote:
> >> >>> On Oct 3, 2014, at 9:30, Radim Vansa <rvansa at redhat.com> wrote:
> >> >>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> recently we had a discussion about what size() returns, but I've
> >> >>>> realized there are more things that users would like to know. My
> >> >>>> question is whether you think that they would really appreciate it,
> >> >>>> or
> >> >>>> whether it's just my QA point of view where I sometimes compute the
> >> >>>> 'checksums' of cache to see if I didn't lost anything.
> >> >>>>
> >> >>>> There are those sizes:
> >> >>>> A) number of owned entries
> >> >>>> B) number of entries stored locally in memory
> >> >>>> C) number of entries stored in each local cache store
> >> >>>> D) number of entries stored in each shared cache store
> >> >>>> E) total number of entries in cache
> >> >>>>
> >> >>>> So far, we can get
> >> >>>> B via withFlags(SKIP_CACHE_LOAD).size()
> >> >>>> (passivation ? B : 0) + firstNonZero(C, D) via size()
> >> >>>> E via distributed iterators / MR
> >> >>>> A via data container iteration + distribution manager query, but
> only
> >> >>>> without cache store
> >> >>>> C or D through
> >> >>>>
> >> >>>>
> getComponentRegistry().getLocalComponent(PersistenceManager.class).getStores()
> >> >>>>
> >> >>>> I think that it would go along with users' expectations if size()
> >> >>>> returned E and for the rest we should have special methods on
> >> >>>> AdvancedCache. That would of course change the meaning of size(),
> but
> >> >>>> I'd say that finally to something that has firm meaning.
> >> >>>>
> >> >>>> WDYT?
> >> >>> There was a lot of arguments in past whether size() and other
> methods
> >> >>> that operate over all the elements (keySet, values) are useful
> because:
> >> >>> - they are approximate (data changes during iteration)
> >> >>> - they are very resource consuming and might be miss-used (this is
> the
> >> >>> reason we chosen to use size() with its current local semantic)
> >> >>>
> >> >>> These methods (size, keys, values) are useful for people and I think
> >> >>> we were not wise to implement them only on top of the local data:
> this is
> >> >>> like preferring efficiency over correctness. This also created a
> lot of
> >> >>> confusion with our users, question like size() doesn't return the
> correct
> >> >>> value being asked regularly. I totally agree that size() returns E
> (i.e.
> >> >>> everything that is stored within the grid, including persistence)
> and it's
> >> >>> performance implications to be documented accordingly. For keySet
> and values
> >> >>> - we should stop implementing them (throw exception) and point
> users to
> >> >>> Will's distributed iterator which is a nicer way to achieve the
> desired
> >> >>> behavior.
> >> >>>
> >> >>> We can also implement keySet() and values() on top of the
> distributed
> >> >>> entry iterator and document that using the iterator directly is
> better.
> >> >> Yes, that's what I meant as well.
> >> >>
> >> >> Cheers,
> >> >> --
> >> >> Mircea Markus
> >> >> Infinispan lead (www.infinispan.org)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> infinispan-dev mailing list
> >> >> infinispan-dev at lists.jboss.org
> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> > _______________________________________________
> >> > infinispan-dev mailing list
> >> > infinispan-dev at lists.jboss.org
> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >>
> >>
> >> --
> >> Radim Vansa <rvansa at redhat.com>
> >> JBoss DataGrid QA
> >>
> >> _______________________________________________
> >> infinispan-dev mailing list
> >> infinispan-dev at lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20141009/d0a0c200/attachment-0001.html