On 2/1/12 10:25 AM, Dan Berindei wrote:
> That's not the way it works; at startup of F, it sends its IP
address
> with the discovery request. Everybody returns its IP address with the
> discovery response, so even though we have F only talking to A (the
> coordinator) initially, F will also know the IP addresses of A,B,C,D and E.
>
Ok, I stand corrected... since we start all the nodes on the same
thread, each of them should reply to the discovery request of the next nodes.
Hmm, can you reproduce this every time ? If so, can you send me the
program so I can run it here ?
However, num_initial_members was set to 3 (the Infinispan default).
Could that make PING not wait for all the responses? If it's like
that, then I suggest we set a (much) higher num_initial_members and a
lower timeout in the default configuration.
Yes, the discovery could return quickly, but the responses would even be
processed if they were received later, so I don't think that's the issue.
The initial discovery should discover *all* IP addresses, later
triggering a discovery because an IP address wasn't found should always
be the exceptional case !
If you start members in turn, then they should easily form a cluster and
not even merge. Here's what can happen on a merge:
- The view is A|1={A,B}, both A and B have IP addresses for A and B
- The view splits into A|2={A} and B|2={B}
- A now marks B's IP address as removable and B marks A's IP address as
removable
- If the cache grows to over 500 entries
(TP.logical_addr_cache_max_size) or TP.logical_addr_cache_expiration
milliseconds elapse (whichever comes first), the entries marked as
removable are removed
- If, *before that* the merge view A|3={A,B} is installed, A unmarks B
and B unmarks A, so the entries won't get removed
So a hypothesis of how those IP addresses get removed could be that the
cluster had a couple of merges, that didn't heal for 2 minutes (?) hard
to believe though...
We have to get to the bottom of this, so it would be great if you had a
program that reproduced this, that I could send myself. The main
question is why the IP address for the target is gone and/or why the IP
address wasn't received in the first place.
In any case, replacing MERGE2 with MERGE3 might help a bit, as MERGE3
[1] periodically broadcasts IP address/logical name and logical address:
"An INFO message carries the logical name and physical address of a
member. Compared to MERGE2, this allows us to immediately send messages
to newly merged members, and not have to solicit this information first.
" (copied from the documentation)
>> Note that everything is blocked at this point, we
>> won't send another message in the entire cluster until we got the physical
address.
Understood. Let me see if I can block sending of the message for a max
time (say 2 seconds) until I get the IP address. Not very nice, and I
prefer a different approach (plus we need to see why this happens in the
first place anyway)...
> As I said; this is an exceptional case, probably caused by Sanne
> starting 12 channels inside the same JVM, at the same time, therefore
> causing a traffic spike, which results in dropped discovery requests or
> responses.
>
Bela, we create the caches on a single thread, so we never have more
than one node joining at the same time.
At most we could have some extra activity if one node can't join the
existing cluster and starts a separate partition, but hardly enough to
cause congestion.
Hmm, does indeed not sound like an issue...
> After than, when F wants to talk to C, it asks the cluster for
C's IP
> address, and that should be a few ms at most.
>
Ok, so when F wanted to send the ClusteredGetCommand request to C,
PING got the physical address right away. But the ClusteredGetCommand
had to wait for STABLE to kick in and for C to ask for retransmission
(because we didn't send any other messages).
Yep. Before I implement some blocking until we have the IP address, or a
timeout elapses, I'd like to try to get to the bottom of this problem
first !
Maybe *we* should use RSVP for our ClusteredGetCommands, since those
can never block... Actually, we don't want to retransmit the request
if we already got a response from another node, so it would be best if
we could ask for retransmission of a particular request explicitly ;-)
I'd rather implement the blocking approach above ! :-)
I wonder if we could also decrease desired_avg_gossip and
stability_delay in STABLE. After all, an extra STABLE round can't slow
us when we're not doing anything, and when we are busy we're going to
hit the max_bytes limit much sooner than the desired_avg_gossip time
limit anyway.
I don't think this is a good idea as it will generate more traffic. The
stable task is not skipped when we have a lot of traffic, so this will
compound the issue.
>> I'm also not sure what to make of these lines:
>>
>>>>> [org.jgroups.protocols.UDP] sanne-55119: no physical address for
>>>>> sanne-53650, dropping message
>>>>> [org.jgroups.protocols.pbcast.GMS] JOIN(sanne-55119) sent to
>>>>> sanne-53650 timed out (after 3000 ms), retrying
>>
>> It appears that sanne-55119 knows the logical name of sanne-53650, and
>> the fact that it's coordinator, but not its physical address.
>> Shouldn't all of this information have arrived at the same time?
>
> Hmm, correct. However, the logical names are kept in (a static)
> UUID.cache and the IP addresses in TP.logical_addr_cache.
>
Ah, so if we have 12 nodes in the same VM they automatically know each
other's logical name - they don't need PING at all!
Yes. Note that logical names are not the problem; even if we evict some
logical name from the cache (and we do this only for removed members),
JGroups will still work as it only needs UUIDs and IP addresses.
Does the logical cache get cleared on channel stop? I think that
would
explain another weird thing I was seeing in the test suite logs,
sometimes everyone in a cluster would suddenly forget everyone else's
logical name and start logging UUIDs.
On a view change, we remove all entries which are *not* in the new view.
However, 'removing' is again simply marking those members as
'removable', and only if the cache grows beyond 500
(-Djgroups.uuid_cache.max_entries=500) entries will all entries older
than 5 seconds (-Djgroups.uuid_cache.max_age=5000) be removed. (There is
no separate reaper task running for this).
So, yes, this can happen, but on the next discovery round, we'll have
the correct values. Again, as I said, UUID.cache is not as important as
TP.logical_addr_cache.
This is running the Transactional benchmark, so it would be simpler
if
we enabled PING trace in the configuration and disabled it before the
actual benchmark starts. I'm going to try it myself :)
How do you run 12 instances ? Did you change something in the config ?
I'd be interested in trying the *exact* same config you're running, to
see what's going on !
[1]
http://www.jgroups.org/manual-3.x/html/protlist.html#MERGE3
--
Bela Ban
Lead JGroups (
http://www.jgroups.org)
JBoss / Red Hat