[infinispan-dev] again: "no physical address"

Wed Feb 1 10:18:06 EST 2012

On 2/1/12 10:25 AM, Dan Berindei wrote:

>> That's not the way it works; at startup of F, it sends its IP address
>> with the discovery request. Everybody returns its IP address with the
>> discovery response, so even though we have F only talking to A (the
>> coordinator) initially, F will also know the IP addresses of A,B,C,D and E.
>>
>
> Ok, I stand corrected... since we start all the nodes on the same
> thread, each of them should reply to the discovery request of the next nodes.

Hmm, can you reproduce this every time ? If so, can you send me the 
program so I can run it here ?

> However, num_initial_members was set to 3 (the Infinispan default).
> Could that make PING not wait for all the responses? If it's like
> that, then I suggest we set a (much) higher num_initial_members and a
> lower timeout in the default configuration.

Yes, the discovery could return quickly, but the responses would even be 
processed if they were received later, so I don't think that's the issue.

The initial discovery should discover *all* IP addresses, later 
triggering a discovery because an IP address wasn't found should always 
be the exceptional case !

If you start members in turn, then they should easily form a cluster and 
not even merge. Here's what can happen on a merge:
- The view is A|1={A,B}, both A and B have IP addresses for A and B
- The view splits into A|2={A} and B|2={B}
- A now marks B's IP address as removable and B marks A's IP address as 
removable
- If the cache grows to over 500 entries 
(TP.logical_addr_cache_max_size) or TP.logical_addr_cache_expiration 
milliseconds elapse (whichever comes first), the entries marked as 
removable are removed
- If, *before that* the merge view A|3={A,B} is installed, A unmarks B 
and B unmarks A, so the entries won't get removed

So a hypothesis of how those IP addresses get removed could be that the 
cluster had a couple of merges, that didn't heal for 2 minutes (?) hard 
to believe though...

We have to get to the bottom of this, so it would be great if you had a 
program that reproduced this, that I could send myself. The main 
question is why the IP address for the target is gone and/or why the IP 
address wasn't received in the first place.

In any case, replacing MERGE2 with MERGE3 might help a bit, as MERGE3 
[1] periodically broadcasts IP address/logical name and logical address: 
"An INFO message carries the logical name and physical address of a 
member. Compared to MERGE2, this allows us to immediately send messages 
to newly merged members, and not have to solicit this information first. 
" (copied from the documentation)

>>>   Note that everything is blocked at this point, we
>>> won't send another message in the entire cluster until we got the physical address.

Understood. Let me see if I can block sending of the message for a max 
time (say 2 seconds) until I get the IP address. Not very nice, and I 
prefer a different approach (plus we need to see why this happens in the 
first place anyway)...

>> As I said; this is an exceptional case, probably caused by Sanne
>> starting 12 channels inside the same JVM, at the same time, therefore
>> causing a traffic spike, which results in dropped discovery requests or
>> responses.
>>
>
> Bela, we create the caches on a single thread, so we never have more
> than one node joining at the same time.
> At most we could have some extra activity if one node can't join the
> existing cluster and starts a separate partition, but hardly enough to
> cause congestion.

Hmm, does indeed not sound like an issue...

>> After than, when F wants to talk to C, it asks the cluster for C's IP
>> address, and that should be a few ms at most.
>>
>
> Ok, so when F wanted to send the ClusteredGetCommand request to C,
> PING got the physical address right away. But the ClusteredGetCommand
> had to wait for STABLE to kick in and for C to ask for retransmission
> (because we didn't send any other messages).

Yep. Before I implement some blocking until we have the IP address, or a 
timeout elapses, I'd like to try to get to the bottom of this problem 
first !

> Maybe *we* should use RSVP for our ClusteredGetCommands, since those
> can never block... Actually, we don't want to retransmit the request
> if we already got a response from another node, so it would be best if
> we could ask for retransmission of a particular request explicitly ;-)

I'd rather implement the blocking approach above ! :-)

> I wonder if we could also decrease desired_avg_gossip and
> stability_delay in STABLE. After all, an extra STABLE round can't slow
> us when we're not doing anything, and when we are busy we're going to
> hit the max_bytes limit much sooner than the desired_avg_gossip time
> limit anyway.

I don't think this is a good idea as it will generate more traffic. The 
stable task is not skipped when we have a lot of traffic, so this will 
compound the issue.

>>> I'm also not sure what to make of these lines:
>>>
>>>>>> [org.jgroups.protocols.UDP] sanne-55119: no physical address for
>>>>>> sanne-53650, dropping message
>>>>>> [org.jgroups.protocols.pbcast.GMS] JOIN(sanne-55119) sent to
>>>>>> sanne-53650 timed out (after 3000 ms), retrying
>>>
>>> It appears that sanne-55119 knows the logical name of sanne-53650, and
>>> the fact that it's coordinator, but not its physical address.
>>> Shouldn't all of this information have arrived at the same time?
>>
>> Hmm, correct. However, the logical names are kept in (a static)
>> UUID.cache and the IP addresses in TP.logical_addr_cache.
>>
>
> Ah, so if we have 12 nodes in the same VM they automatically know each
> other's logical name - they don't need PING at all!

Yes. Note that logical names are not the problem; even if we evict some 
logical name from the cache (and we do this only for removed members), 
JGroups will still work as it only needs UUIDs and IP addresses.

> Does the logical cache get cleared on channel stop? I think that would
> explain another weird thing I was seeing in the test suite logs,
> sometimes everyone in a cluster would suddenly forget everyone else's
> logical name and start logging UUIDs.

On a view change, we remove all entries which are *not* in the new view. 
However, 'removing' is again simply marking those members as 
'removable', and only if the cache grows beyond 500 
(-Djgroups.uuid_cache.max_entries=500) entries will all entries older 
than 5 seconds (-Djgroups.uuid_cache.max_age=5000) be removed. (There is 
no separate reaper task running for this).

So, yes, this can happen, but on the next discovery round, we'll have 
the correct values. Again, as I said, UUID.cache is not as important as 
TP.logical_addr_cache.

> This is running the Transactional benchmark, so it would be simpler if
> we enabled PING trace in the configuration and disabled it before the
> actual benchmark starts. I'm going to try it myself :)

How do you run 12 instances ? Did you change something in the config ? 
I'd be interested in trying the *exact* same config you're running, to 
see what's going on !

[1] http://www.jgroups.org/manual-3.x/html/protlist.html#MERGE3

-- 
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat