[infinispan-dev] Put issues with newly joining node

Dan Berindei dan.berindei at gmail.com
Wed Dec 5 11:24:16 EST 2012


Yes, no visible impact.


On Wed, Dec 5, 2012 at 5:46 PM, Sanne Grinovero <sanne at infinispan.org>wrote:

> So to make sure I understood that, this has no visible impact on the
> functionality of API methods, correct? Like any get operation would
> successfully retrieve a remote entry if one exists somewhere?
>
> On 5 December 2012 15:42, Dan Berindei <dan.berindei at gmail.com> wrote:
> >
> > On Wed, Dec 5, 2012 at 4:20 PM, Sanne Grinovero <sanne at infinispan.org>
> > wrote:
> >>
> >> On 5 December 2012 14:01, Bela Ban <bban at redhat.com> wrote:
> >> >
> >> > On 12/5/12 1:23 PM, Sanne Grinovero wrote:
> >> >> On 5 December 2012 11:02, Galder Zamarreño <galder at redhat.com>
> wrote:
> >> >>> On Dec 4, 2012, at 10:22 AM, Sanne Grinovero <sanne at infinispan.org>
> >> >>> wrote:
> >> >>>
> >> >>>> On 4 December 2012 09:14, Galder Zamarreño <galder at redhat.com>
> wrote:
> >> >>>>> Hey Dan/Adrian,
> >> >>>>>
> >> >>>>> Re: https://issues.jboss.org/browse/ISPN-2541
> >> >>>>>
> >> >>>>> I'm looking at this intermittent failure, and it seems to be
> caused
> >> >>>>> by the fact that the test does not wait for the cluster to be
> formed when
> >> >>>>> the new node is started, which can lead a replication timeout
> failure from
> >> >>>>> the new joining node.
> >> >>>>>
> >> >>>>> The test can easily be fixed by waiting for cluster to form, and
> >> >>>>> then do the call.
> >> >>>>>
> >> >>>> [...]
> >> >>>>
> >> >>>> I don't think the cache should ever be in an illegal state to be
> used
> >> >>>> after being started. So Infinispan should not require tests to wait
> >> >>>> for a "cluster to be formed", I'd rather guarantee that after a
> cache
> >> >>>> is started it's usable.
> >> >>> Precisely, which is why I raised the flag instead of going down the
> >> >>> easy path.
> >> >>>
> >> >>>> If this is not possible, then any application would also need to
> wait
> >> >>>> for that "cluster formed" event, and we should expose an API for
> >> >>>> that.
> >> >>> The problem is considering when a cluster is formed. How many nodes
> >> >>> should you wait for?
> >> >>>
> >> >> Why can't we rely on JGroups Discovery to know that, as a user I
> >> >> already specified the expected initial group size with
> >> >> num_initial_members
> >> >> Don't want to repeat that configuration ;-)
> >> >
> >> >
> >> > I don't understand this discussion: when a new node join, it'll return
> >> > from JChannel.connect() when it received a JOIN response from the
> >> > coordinator, with the current view... or are you guys talking about
> >> > Infinispan's 'service views' ?
> >>
> >> +1
> >>
> >> That's why I'm confused too, and not understanding how it is possible
> >> that a Cache is returned to the application - which doesn't have a
> >> clue about number of expected nodes - in a state for which the
> >> "cluster is not formed yet". That should never happen!?
> >>
> >
> > It's simple: getCache() returns once the joiner has received ownership of
> > some segments (in distributed mode) and once it received all the data it
> > owner (dist and repl). This does not guarantee that the other nodes see
> the
> > joiner as a full member at the time getCache() has returned.
> >
> > This doesn't mean that the cache is not functional, on the contrary we
> could
> > return even before the joiner had received the data and the cache would
> > still work. But because some nodes think state transfer is still in
> > progress, the tests do run into state transfer corner cases that aren't
> > handled properly (they're getting rarer, but we still have them).
> >
> >
> >>
> >> I never understood why the test framework in Infinispan requires this
> >> to happen in all tests - even in the cases listed by Mircea that the
> >> testsuite is looking for something very specific, I would expect the
> >> wait to be unnecessary. (or more precisely, to have been blocked
> >> already for long enough)
> >>
> >
> > getCache() only waits enough for the cache to "work", it doesn't wait
> (and I
> > don't think it should wait) for all the other nodes to acknowledge the
> > joiner as a full member (i.e. in the "read" consistent hash). Because of
> > this, assertions made on nodes other than the joiner can fail (in
> addition
> > to the aforementioned corner cases in state transfer).
> >
> > It's also possible (and it was quite likely with older JGroups versions)
> > that a joiner would actually form a new cluster by itself instead of
> joining
> > the existing nodes in a single cluster. When that happens, getCache()
> > definitely returns without the cluster being formed, and we have to wait
> for
> > the separate clusters to find each other and merge before running our
> test.
> >
> > Cheers
> > Dan
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20121205/456ffc21/attachment.html 


More information about the infinispan-dev mailing list