On Wed, Feb 1, 2012 at 9:48 AM, Bela Ban <bban(a)redhat.com> wrote:
On 1/31/12 10:55 PM, Dan Berindei wrote:
> Hi Bela
>
> I guess it's pretty clear now... In Sanne's thread dump the main
> thread is blocked in a cache.put() call after the cluster has
> supposedly already formed:
>
> "org.infinispan.benchmark.Transactional.main()" prio=10
> tid=0x00007ff4045de000 nid=0x7c92 in Object.wait()
> [0x00007ff40919d000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on<0x00000007f61997d0> (a
> org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$FutureCollator)
> at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$FutureCollator.getResponseList(CommandAwareRpcDispatcher.java:372)
> ...
> at
org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(DistributionManagerImpl.java:169)
> ...
> at org.infinispan.CacheSupport.put(CacheSupport.java:52)
> at org.infinispan.benchmark.Transactional.start(Transactional.java:110)
> at org.infinispan.benchmark.Transactional.main(Transactional.java:70)
>
> State transfer was disabled, so during the cluster startup the nodes
> only had to communicate with the coordinator and not between them. The
> put command had to get the old value from another node, so it needed
> the physical address and had to block until PING would retrieve it.
That's not the way it works; at startup of F, it sends its IP address
with the discovery request. Everybody returns its IP address with the
discovery response, so even though we have F only talking to A (the
coordinator) initially, F will also know the IP addresses of A,B,C,D and E.
Ok, I stand corrected... since we start all the nodes on the same
thread, each of them should reply to the discovery request of the next
nodes.
However, num_initial_members was set to 3 (the Infinispan default).
Could that make PING not wait for all the responses? If it's like
that, then I suggest we set a (much) higher num_initial_members and a
lower timeout in the default configuration.
> Does PING use RSVP
No: (1) I don;'t want a dependency of Discovery on RSVP and (2) the
discovery is unreliable; discovery requests or responses can get dropped.
Right, I keep forgetting that every protocol is optional!
> or does it wait for the normal STABLE timeout for retransmission?
> Note that everything is blocked at this point, we
> won't send another message in the entire cluster until we got the physical
address.
As I said; this is an exceptional case, probably caused by Sanne
starting 12 channels inside the same JVM, at the same time, therefore
causing a traffic spike, which results in dropped discovery requests or
responses.
Bela, we create the caches on a single thread, so we never have more
than one node joining at the same time.
At most we could have some extra activity if one node can't join the
existing cluster and starts a separate partition, but hardly enough to
cause congestion.
After than, when F wants to talk to C, it asks the cluster for
C's IP
address, and that should be a few ms at most.
Ok, so when F wanted to send the ClusteredGetCommand request to C,
PING got the physical address right away. But the ClusteredGetCommand
had to wait for STABLE to kick in and for C to ask for retransmission
(because we didn't send any other messages).
Maybe *we* should use RSVP for our ClusteredGetCommands, since those
can never block... Actually, we don't want to retransmit the request
if we already got a response from another node, so it would be best if
we could ask for retransmission of a particular request explicitly ;-)
I wonder if we could also decrease desired_avg_gossip and
stability_delay in STABLE. After all, an extra STABLE round can't slow
us when we're not doing anything, and when we are busy we're going to
hit the max_bytes limit much sooner than the desired_avg_gossip time
limit anyway.
> I'm sure you've already considered it before, but why not make the
> physical addresses a part of the view installation message? This
> should ensure that every node can communicate with every other node by
> the time the view is installed.
There's a few reasons:
- I don't want to make GMS dependent on logical addresses. GMS is
completely independent and shouldn't know about physical addresses
- At the time GMS kicks in, it's already too late. Remember, F needs to
send a unicast JOIN request to A, but at this point it doesn't yet know
A's address
- MERGE{2,3} also use discovery to detect sub-partitions to be merged,
so discovery needs to be a separate piece of functionality
- A View is already big as it is, and I've managed to reduce its size
even more, but adding physical addresses would blow up the size of View
even more, especially in large clusters
Thanks for the explanation.
> I'm also not sure what to make of these lines:
>
>>>> [org.jgroups.protocols.UDP] sanne-55119: no physical address for
>>>> sanne-53650, dropping message
>>>> [org.jgroups.protocols.pbcast.GMS] JOIN(sanne-55119) sent to
>>>> sanne-53650 timed out (after 3000 ms), retrying
>
> It appears that sanne-55119 knows the logical name of sanne-53650, and
> the fact that it's coordinator, but not its physical address.
> Shouldn't all of this information have arrived at the same time?
Hmm, correct. However, the logical names are kept in (a static)
UUID.cache and the IP addresses in TP.logical_addr_cache.
Ah, so if we have 12 nodes in the same VM they automatically know each
other's logical name - they don't need PING at all!
Does the logical cache get cleared on channel stop? I think that would
explain another weird thing I was seeing in the test suite logs,
sometimes everyone in a cluster would suddenly forget everyone else's
logical name and start logging UUIDs.
I suggest to do the following when this happens (can you reproduce
this ?):
- Before: set enable_diagnostics=true in UDP
- probe.sh op=UDP.printLogicalAddressCache // you can replace probe.sh
with java -jar jgroups.jar org.jgroups.tests.Probe
Here you can dump the logical caches, to see whether this information is
absent.
You could also enable tracing for PING:
probe.sh op=PING.setLevel["trace"]
This is running the Transactional benchmark, so it would be simpler if
we enabled PING trace in the configuration and disabled it before the
actual benchmark starts. I'm going to try it myself :)
Cheers
Dan