[infinispan-dev] Put issues with newly joining node

Sanne Grinovero sanne at infinispan.org
Wed Dec 5 10:46:53 EST 2012


So to make sure I understood that, this has no visible impact on the
functionality of API methods, correct? Like any get operation would
successfully retrieve a remote entry if one exists somewhere?

On 5 December 2012 15:42, Dan Berindei <dan.berindei at gmail.com> wrote:
>
> On Wed, Dec 5, 2012 at 4:20 PM, Sanne Grinovero <sanne at infinispan.org>
> wrote:
>>
>> On 5 December 2012 14:01, Bela Ban <bban at redhat.com> wrote:
>> >
>> > On 12/5/12 1:23 PM, Sanne Grinovero wrote:
>> >> On 5 December 2012 11:02, Galder Zamarreño <galder at redhat.com> wrote:
>> >>> On Dec 4, 2012, at 10:22 AM, Sanne Grinovero <sanne at infinispan.org>
>> >>> wrote:
>> >>>
>> >>>> On 4 December 2012 09:14, Galder Zamarreño <galder at redhat.com> wrote:
>> >>>>> Hey Dan/Adrian,
>> >>>>>
>> >>>>> Re: https://issues.jboss.org/browse/ISPN-2541
>> >>>>>
>> >>>>> I'm looking at this intermittent failure, and it seems to be caused
>> >>>>> by the fact that the test does not wait for the cluster to be formed when
>> >>>>> the new node is started, which can lead a replication timeout failure from
>> >>>>> the new joining node.
>> >>>>>
>> >>>>> The test can easily be fixed by waiting for cluster to form, and
>> >>>>> then do the call.
>> >>>>>
>> >>>> [...]
>> >>>>
>> >>>> I don't think the cache should ever be in an illegal state to be used
>> >>>> after being started. So Infinispan should not require tests to wait
>> >>>> for a "cluster to be formed", I'd rather guarantee that after a cache
>> >>>> is started it's usable.
>> >>> Precisely, which is why I raised the flag instead of going down the
>> >>> easy path.
>> >>>
>> >>>> If this is not possible, then any application would also need to wait
>> >>>> for that "cluster formed" event, and we should expose an API for
>> >>>> that.
>> >>> The problem is considering when a cluster is formed. How many nodes
>> >>> should you wait for?
>> >>>
>> >> Why can't we rely on JGroups Discovery to know that, as a user I
>> >> already specified the expected initial group size with
>> >> num_initial_members
>> >> Don't want to repeat that configuration ;-)
>> >
>> >
>> > I don't understand this discussion: when a new node join, it'll return
>> > from JChannel.connect() when it received a JOIN response from the
>> > coordinator, with the current view... or are you guys talking about
>> > Infinispan's 'service views' ?
>>
>> +1
>>
>> That's why I'm confused too, and not understanding how it is possible
>> that a Cache is returned to the application - which doesn't have a
>> clue about number of expected nodes - in a state for which the
>> "cluster is not formed yet". That should never happen!?
>>
>
> It's simple: getCache() returns once the joiner has received ownership of
> some segments (in distributed mode) and once it received all the data it
> owner (dist and repl). This does not guarantee that the other nodes see the
> joiner as a full member at the time getCache() has returned.
>
> This doesn't mean that the cache is not functional, on the contrary we could
> return even before the joiner had received the data and the cache would
> still work. But because some nodes think state transfer is still in
> progress, the tests do run into state transfer corner cases that aren't
> handled properly (they're getting rarer, but we still have them).
>
>
>>
>> I never understood why the test framework in Infinispan requires this
>> to happen in all tests - even in the cases listed by Mircea that the
>> testsuite is looking for something very specific, I would expect the
>> wait to be unnecessary. (or more precisely, to have been blocked
>> already for long enough)
>>
>
> getCache() only waits enough for the cache to "work", it doesn't wait (and I
> don't think it should wait) for all the other nodes to acknowledge the
> joiner as a full member (i.e. in the "read" consistent hash). Because of
> this, assertions made on nodes other than the joiner can fail (in addition
> to the aforementioned corner cases in state transfer).
>
> It's also possible (and it was quite likely with older JGroups versions)
> that a joiner would actually form a new cluster by itself instead of joining
> the existing nodes in a single cluster. When that happens, getCache()
> definitely returns without the cluster being formed, and we have to wait for
> the separate clusters to find each other and merge before running our test.
>
> Cheers
> Dan
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev



More information about the infinispan-dev mailing list