[infinispan-dev] About size()

Fri Oct 10 12:30:54 EDT 2014

On Wed, Oct 8, 2014 at 12:23 PM, Dan Berindei <dan.berindei at gmail.com> wrote:
>
>
> On Wed, Oct 8, 2014 at 6:14 PM, William Burns <mudokonman at gmail.com> wrote:
>>
>> On Wed, Oct 8, 2014 at 10:57 AM, Dan Berindei <dan.berindei at gmail.com>
>> wrote:
>> >
>> >
>> > On Wed, Oct 8, 2014 at 5:42 PM, William Burns <mudokonman at gmail.com>
>> > wrote:
>> >>
>> >> So it seems we would want to change this for 7.0 if possible since it
>> >> would be a bigger change for something like 7.1 and 8.0 would be even
>> >> further out.  I should be able to put this together for CR2.
>> >
>> >
>> > I'm not 100% convinced that we need it for 7.x. For 8.0 I would
>> > recommend
>> > removing the size() method altogether, and providing some looser
>> > "statistics" instead.
>>
>> Yeah I guess I don't know enough about the demand for these methods or
>> what people wanted to use them for to know what kind of priority they
>> should be given.
>>
>> It sounds like you are talking about decoupling from the
>> Map/ConcurrentMap interface completely then, right?  So we would also
>> eliminate the other bulk methods (keySet, values, entrySet)?
>
>
> Yes, I would base the Cache interface on JSR-107's Cache, which doesn't have
> size() or the other methods.
>
>>
>>
>> >
>> >>
>> >>
>> >> It seems that we want to implement keySet, values and entrySet methods
>> >> using the entry iterator approach.
>> >>
>> >> It is however unclear for the size method if we want to use MR entry
>> >> counting and not worry about the rehash and passivation issues since
>> >> it is just an estimation anyways.  Or if we want to also use the entry
>> >> iterator which should be closer approximation but will require more
>> >> network overhead and memory usage.
>> >
>> >
>> > +1 to use the entry iterator from me, ignoring state transfer we can get
>> > some pretty wild fluctuations in the size of the cache.
>>
>> That is personally my feeling as well, but I tend to err more on the
>> side of correctness to begin with.
>>
>> > We could use a distributed task for Cache.isEmpty() instead of size() ==
>> > 0,
>> > though.
>>
>> Yes that should be a good optimization either way.
>>
>> >
>> >>
>> >>
>> >> Also we didn't really talk about the fact that these methods would
>> >> ignore ongoing transactions and if that is a concern or not.
>> >>
>> >
>> > It might be a concern for the Hibernate 2LC impl, it was their TCK that
>> > prompted the last round of discussions about clear().
>>
>> Although I wonder how much these methods are even used since they only
>> work for Local, Replication or Invalidation caches in their current
>> state (and didn't even use loaders until 6.0).
>
>
> There is some more information about the test in the mailing list discussion
> [1]
> There's also a JIRA for clear() [2]
>
> I think 2LC almost never uses distribution, so size() being local-only
> didn't matter, but making it non-tx could cause problems - at least for that
> particular test.

I had toyed around with the following idea before, but I never thought
of it in the scope of the size method solely, but I have a solution
that would work mostly for transactional caches.  Essentially the size
method would always operate in a READ_COMMITTED like state, using
REPEATABLE_READ doesn't seem feasible since we can't keep all the
contents in memory.  Essentially the iterator would be ran and for
each key that is found it checks the context to see if it is there.
If the context entry is marked as removed it doesn't count the key, if
the key is there it marks the key as found and counts it, and if it is
not found it counts it.  Then after iteration it finds all the keys in
the context that were not found and also adds them to the count.  This
way it doesn't need to store additional memory (besides iteration
costs) as all the context information is in memory.

My original thought was to also make the EntryIterator transactional
in the same way which also means the keySet, entrySet and values
methods could do the same things.  The main reason stumbling block I
had was the fact that the iterator and various collections returned
could be used outside of the ongoing transaction which didn't seem to
make much sense to me.  But maybe these should be changed to be more
like backing maps which HashMap, ConcurrentHashMap etc use for their
methods, where instead it would pick up the transaction if there is
one in the current thread and if there is no transaction just start an
implicit one.  This however was a big change from how these
collections work currently in that they are in memory copies only.

What do you guys think?

>
> [1] http://lists.jboss.org/pipermail/infinispan-dev/2013-October/013914.html
> [2] https://issues.jboss.org/browse/ISPN-3656
>
>>
>> >
>> > We haven't talked about what size(), keySet() and values() should return
>> > for
>> > an invalidation cache either... I forget, does the distributed entry
>> > iterator work with invalidation caches?
>>
>> It works the same as a local cache so only the local node contents are
>> returned.  Replicated does the same thing, distributed is the only
>> special case.  This was the only thing that made sense to me, but if
>> you have any ideas that would  be great to hear for possibly enhancing
>> Invalidation iteration.
>
>
> Sounds good to me.
>
> cache.get(k) will search on all the nodes via ClusterLoader, so there is a
> certain appeal in making the entry iterator do the same. But invalidation
> caches are used with an external (non-CacheLoader) source of data anyway, so
> we can never return "all the entries".
>
>>
>> >
>> >
>> >>
>> >>  - Will
>> >>
>> >> On Wed, Oct 8, 2014 at 10:13 AM, Mircea Markus <mmarkus at redhat.com>
>> >> wrote:
>> >> >
>> >> > On Oct 8, 2014, at 15:11, Dan Berindei <dan.berindei at gmail.com>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> On Wed, Oct 8, 2014 at 5:03 PM, Mircea Markus <mmarkus at redhat.com>
>> >> >> wrote:
>> >> >> On Oct 3, 2014, at 9:30, Radim Vansa <rvansa at redhat.com> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > recently we had a discussion about what size() returns, but I've
>> >> >> > realized there are more things that users would like to know. My
>> >> >> > question is whether you think that they would really appreciate
>> >> >> > it,
>> >> >> > or
>> >> >> > whether it's just my QA point of view where I sometimes compute
>> >> >> > the
>> >> >> > 'checksums' of cache to see if I didn't lost anything.
>> >> >> >
>> >> >> > There are those sizes:
>> >> >> > A) number of owned entries
>> >> >> > B) number of entries stored locally in memory
>> >> >> > C) number of entries stored in each local cache store
>> >> >> > D) number of entries stored in each shared cache store
>> >> >> > E) total number of entries in cache
>> >> >> >
>> >> >> > So far, we can get
>> >> >> > B via withFlags(SKIP_CACHE_LOAD).size()
>> >> >> > (passivation ? B : 0) + firstNonZero(C, D) via size()
>> >> >> > E via distributed iterators / MR
>> >> >> > A via data container iteration + distribution manager query, but
>> >> >> > only
>> >> >> > without cache store
>> >> >> > C or D through
>> >> >> >
>> >> >> >
>> >> >> > getComponentRegistry().getLocalComponent(PersistenceManager.class).getStores()
>> >> >> >
>> >> >> > I think that it would go along with users' expectations if size()
>> >> >> > returned E and for the rest we should have special methods on
>> >> >> > AdvancedCache. That would of course change the meaning of size(),
>> >> >> > but
>> >> >> > I'd say that finally to something that has firm meaning.
>> >> >> >
>> >> >> > WDYT?
>> >> >>
>> >> >> There was a lot of arguments in past whether size() and other
>> >> >> methods
>> >> >> that operate over all the elements (keySet, values) are useful
>> >> >> because:
>> >> >> - they are approximate (data changes during iteration)
>> >> >> - they are very resource consuming and might be miss-used (this is
>> >> >> the
>> >> >> reason we chosen to use size() with its current local semantic)
>> >> >>
>> >> >> These methods (size, keys, values) are useful for people and I think
>> >> >> we
>> >> >> were not wise to implement them only on top of the local data: this
>> >> >> is like
>> >> >> preferring efficiency over correctness. This also created a lot of
>> >> >> confusion
>> >> >> with our users, question like size() doesn't return the correct
>> >> >> value being
>> >> >> asked regularly. I totally agree that size() returns E (i.e.
>> >> >> everything that
>> >> >> is stored within the grid, including persistence) and it's
>> >> >> performance
>> >> >> implications to be documented accordingly. For keySet and values -
>> >> >> we should
>> >> >> stop implementing them (throw exception) and point users to Will's
>> >> >> distributed iterator which is a nicer way to achieve the desired
>> >> >> behavior.
>> >> >>
>> >> >> We can also implement keySet() and values() on top of the
>> >> >> distributed
>> >> >> entry iterator and document that using the iterator directly is
>> >> >> better.
>> >> >
>> >> > Yes, that's what I meant as well.
>> >> >
>> >> > Cheers,
>> >> > --
>> >> > Mircea Markus
>> >> > Infinispan lead (www.infinispan.org)
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > infinispan-dev mailing list
>> >> > infinispan-dev at lists.jboss.org
>> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >>
>> >> _______________________________________________
>> >> infinispan-dev mailing list
>> >> infinispan-dev at lists.jboss.org
>> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > infinispan-dev mailing list
>> > infinispan-dev at lists.jboss.org
>> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev