[infinispan-dev] again: "no physical address"

Thu Feb 2 01:46:17 EST 2012

Hi Bela

I think I found why you weren't seeing the warnings. The
bench-log4j.xml in github master is configured to log only to the log
file (benchmark.log). If you add an <appender-ref ref="CONSOLE"/>
you'll see the warnings on the console as well.

I am now able to reproduce it pretty reliably, even from IntelliJ. I
created a run configuration for the Transactional class, I set the
module to infinispan-5.1-SNAPSHOT, and I added these JVM arguments:

-Dbench.loops=3000 -Dbench.writerThreads=33 -Dbench.readerThreads=10
-Dbench.dist=true -Dbench.transactional=true -Dbench.numkeys=500
-Dbench.payloadsize=10240 -Dbench.nodes=12 -Djava.net.preferIPv4Stack
-Dlog4j.configuration=bench-log4j.xml

In Sanne's thread dump it was a ClusteredGetCommand that was waiting
for a physical address, but in my test runs it's usually a JOIN that's
failing. And it seems that it happens because we got a discovery
response from 3 nodes (that is our initial num_initial_members), but
none of them was the coordinator:

2012-02-02 08:04:41,672 TRACE [org.jgroups.protocols.PING] (main)
discovery took 8 ms: responses: 3 total (2 servers (0 coord), 1
clients)
2012-02-02 08:04:41,672 TRACE [org.jgroups.protocols.pbcast.GMS]
(main) localhost-47135: initial_mbrs are localhost-36539
localhost-30921
2012-02-02 08:04:41,672 DEBUG [org.jgroups.protocols.pbcast.GMS]
(main) election results: {localhost-32436=2}
2012-02-02 08:04:41,672 DEBUG [org.jgroups.protocols.pbcast.GMS]
(main) sending JOIN(localhost-47135) to localhost-32436
2012-02-02 08:04:41,673 TRACE [org.jgroups.protocols.UNICAST2] (main)
localhost-47135: created connection to localhost-32436 (conn_id=0)
2012-02-02 08:04:41,673 TRACE [org.jgroups.protocols.UNICAST2] (main)
localhost-47135 --> DATA(localhost-32436: #1, conn_id=0, first)
2012-02-02 08:04:41,673 TRACE [org.jgroups.protocols.UDP] (main)
sending msg to localhost-32436, src=localhost-47135, headers are GMS:
GmsHeader[JOIN_REQ]: mbr=localhost-47135, UNICAST2: DATA, seqno=1,
first, UDP: [channel_name=ISPN]
2012-02-02 08:04:41,673 WARN  [org.jgroups.protocols.UDP] (main)
localhost-47135: no physical address for localhost-32436, dropping
message

Cheers
Dan

On Wed, Feb 1, 2012 at 7:12 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
> On 1 February 2012 16:40, Bela Ban <bban at redhat.com> wrote:
>> Your benchmark is giving me the creeps !
>
> Manik was the original author, I've only been adapting it slightly to
> identify performance issues. I wouldn't have used Maven either, but
> it's serving me well especially since it turns out I have to
> frequently change JGroups version ;-)
>
>> First, which version of JGroups / Infinispan does this use ? Second, is
>> there a way to start this in an IDE rather than through maven ? Third, I
>> don't think bench-jgroups.xml is picked up at all ! How do I make a change
>> to bench-jgroups.xml and have Transactional *use* it ?
>
> 1) The intention is to compare Infinispan versions, and see how we
> make progress, so the JGroups version is defined as each different
> Infinispan version was released with, or in case of a SNAPSHOT the
> version it's currently using.
>
> So, assuming you have a fresh snapshot of Infinispan 5.1.0-SNAPSHOT,
> it's JGroups 3.0.4.Final
>
> 2) As I pointed out in the previous mail, I've reconfigured it for you
> to not pick up the bench-jgroups.xml but go with Infinispan's default
> UDP configuration.
> You have to comment line 15 of bench.sh to use the bench-jgroups.xml,
> but then I can't reproduce the issue anymore.
>
>> This maven crap hides so much, I never know what's going on under the covers
>> !@$@ Do I have to do mvn install or mvn package when I make a change to
>> bench-jgroups.xml ?
>
> Aren't you using bench.sh as I pointed out in the script in my
> previous mail? It does mvn install before running it.
> But it's also configured to NOT use bench-jgroups.xml, but rather
> jgroups-udp.xml.
>
>> I don't think I'll ever switch to this f*cking piece of shit !
>
> I don't think Maven is to be blamed today! What's wrong?
>
> Anyway one of the nice things of this little benchmark is exactly that
> it it's a single class with a main file, so you can just import it in
> you IDE and run. Any IDE will pick the correct dependencies, thanks to
> Maven. Just that if you do, it will use the default test properties as
> hardcoded in the test class org.infinispan.benchmark.Transactional:
> please set the same environment variable as bench.sh does, unless you
> don't want to run my same configuration.
>
> Just ping me for any doubt.
>
> Cheers,
> Sanne
>
>>
>> Please ping when as soon as you've calmed down... :-)
>> Cheers,
>>
>>
>>
>>
>> On 2/1/12 4:52 PM, Sanne Grinovero wrote:
>>>
>>> On 1 February 2012 15:18, Bela Ban<bban at redhat.com>  wrote:
>>>>
>>>>
>>>>
>>>> On 2/1/12 10:25 AM, Dan Berindei wrote:
>>>>
>>>>>> That's not the way it works; at startup of F, it sends its IP address
>>>>>> with the discovery request. Everybody returns its IP address with the
>>>>>> discovery response, so even though we have F only talking to A (the
>>>>>> coordinator) initially, F will also know the IP addresses of A,B,C,D
>>>>>> and E.
>>>>>>
>>>>>
>>>>> Ok, I stand corrected... since we start all the nodes on the same
>>>>> thread, each of them should reply to the discovery request of the next
>>>>> nodes.
>>>>
>>>>
>>>>
>>>> Hmm, can you reproduce this every time ? If so, can you send me the
>>>> program so I can run it here ?
>>>>
>>>>
>>>>> However, num_initial_members was set to 3 (the Infinispan default).
>>>>> Could that make PING not wait for all the responses? If it's like
>>>>> that, then I suggest we set a (much) higher num_initial_members and a
>>>>> lower timeout in the default configuration.
>>>>
>>>>
>>>>
>>>> Yes, the discovery could return quickly, but the responses would even be
>>>> processed if they were received later, so I don't think that's the issue.
>>>>
>>>> The initial discovery should discover *all* IP addresses, later
>>>> triggering a discovery because an IP address wasn't found should always
>>>> be the exceptional case !
>>>>
>>>> If you start members in turn, then they should easily form a cluster and
>>>> not even merge. Here's what can happen on a merge:
>>>> - The view is A|1={A,B}, both A and B have IP addresses for A and B
>>>> - The view splits into A|2={A} and B|2={B}
>>>> - A now marks B's IP address as removable and B marks A's IP address as
>>>> removable
>>>> - If the cache grows to over 500 entries
>>>> (TP.logical_addr_cache_max_size) or TP.logical_addr_cache_expiration
>>>> milliseconds elapse (whichever comes first), the entries marked as
>>>> removable are removed
>>>> - If, *before that* the merge view A|3={A,B} is installed, A unmarks B
>>>> and B unmarks A, so the entries won't get removed
>>>>
>>>> So a hypothesis of how those IP addresses get removed could be that the
>>>> cluster had a couple of merges, that didn't heal for 2 minutes (?) hard
>>>> to believe though...
>>>>
>>>> We have to get to the bottom of this, so it would be great if you had a
>>>> program that reproduced this, that I could send myself. The main
>>>> question is why the IP address for the target is gone and/or why the IP
>>>> address wasn't received in the first place.
>>>>
>>>> In any case, replacing MERGE2 with MERGE3 might help a bit, as MERGE3
>>>> [1] periodically broadcasts IP address/logical name and logical address:
>>>> "An INFO message carries the logical name and physical address of a
>>>> member. Compared to MERGE2, this allows us to immediately send messages
>>>> to newly merged members, and not have to solicit this information first.
>>>> " (copied from the documentation)
>>>>
>>>>
>>>>
>>>>>>> Â  Note that everything is blocked at this point, we
>>>>>>>
>>>>>>> won't send another message in the entire cluster until we got the
>>>>>>> physical address.
>>>>
>>>>
>>>>
>>>> Understood. Let me see if I can block sending of the message for a max
>>>> time (say 2 seconds) until I get the IP address. Not very nice, and I
>>>> prefer a different approach (plus we need to see why this happens in the
>>>> first place anyway)...
>>>>
>>>>
>>>>>> As I said; this is an exceptional case, probably caused by Sanne
>>>>>> starting 12 channels inside the same JVM, at the same time, therefore
>>>>>> causing a traffic spike, which results in dropped discovery requests or
>>>>>> responses.
>>>>>>
>>>>>
>>>>> Bela, we create the caches on a single thread, so we never have more
>>>>> than one node joining at the same time.
>>>>> At most we could have some extra activity if one node can't join the
>>>>> existing cluster and starts a separate partition, but hardly enough to
>>>>> cause congestion.
>>>>
>>>>
>>>>
>>>> Hmm, does indeed not sound like an issue...
>>>>
>>>>
>>>>>> After than, when F wants to talk to C, it asks the cluster for C's IP
>>>>>> address, and that should be a few ms at most.
>>>>>>
>>>>>
>>>>> Ok, so when F wanted to send the ClusteredGetCommand request to C,
>>>>> PING got the physical address right away. But the ClusteredGetCommand
>>>>> had to wait for STABLE to kick in and for C to ask for retransmission
>>>>> (because we didn't send any other messages).
>>>>
>>>>
>>>>
>>>> Yep. Before I implement some blocking until we have the IP address, or a
>>>> timeout elapses, I'd like to try to get to the bottom of this problem
>>>> first !
>>>>
>>>>
>>>>> Maybe *we* should use RSVP for our ClusteredGetCommands, since those
>>>>> can never block... Actually, we don't want to retransmit the request
>>>>> if we already got a response from another node, so it would be best if
>>>>> we could ask for retransmission of a particular request explicitly ;-)
>>>>
>>>>
>>>>
>>>> I'd rather implement the blocking approach above ! :-)
>>>>
>>>>
>>>>> I wonder if we could also decrease desired_avg_gossip and
>>>>> stability_delay in STABLE. After all, an extra STABLE round can't slow
>>>>> us when we're not doing anything, and when we are busy we're going to
>>>>> hit the max_bytes limit much sooner than the desired_avg_gossip time
>>>>> limit anyway.
>>>>
>>>>
>>>>
>>>> I don't think this is a good idea as it will generate more traffic. The
>>>> stable task is not skipped when we have a lot of traffic, so this will
>>>> compound the issue.
>>>>
>>>>
>>>>>>> I'm also not sure what to make of these lines:
>>>>>>>
>>>>>>>>>> [org.jgroups.protocols.UDP] sanne-55119: no physical address for
>>>>>>>>>> sanne-53650, dropping message
>>>>>>>>>> [org.jgroups.protocols.pbcast.GMS] JOIN(sanne-55119) sent to
>>>>>>>>>> sanne-53650 timed out (after 3000 ms), retrying
>>>>>>>
>>>>>>>
>>>>>>> It appears that sanne-55119 knows the logical name of sanne-53650, and
>>>>>>> the fact that it's coordinator, but not its physical address.
>>>>>>> Shouldn't all of this information have arrived at the same time?
>>>>>>
>>>>>>
>>>>>> Hmm, correct. However, the logical names are kept in (a static)
>>>>>> UUID.cache and the IP addresses in TP.logical_addr_cache.
>>>>>>
>>>>>
>>>>> Ah, so if we have 12 nodes in the same VM they automatically know each
>>>>> other's logical name - they don't need PING at all!
>>>>
>>>>
>>>>
>>>> Yes. Note that logical names are not the problem; even if we evict some
>>>> logical name from the cache (and we do this only for removed members),
>>>> JGroups will still work as it only needs UUIDs and IP addresses.
>>>>
>>>>
>>>>> Does the logical cache get cleared on channel stop? I think that would
>>>>> explain another weird thing I was seeing in the test suite logs,
>>>>> sometimes everyone in a cluster would suddenly forget everyone else's
>>>>> logical name and start logging UUIDs.
>>>>
>>>>
>>>>
>>>> On a view change, we remove all entries which are *not* in the new view.
>>>> However, 'removing' is again simply marking those members as
>>>> 'removable', and only if the cache grows beyond 500
>>>> (-Djgroups.uuid_cache.max_entries=500) entries will all entries older
>>>> than 5 seconds (-Djgroups.uuid_cache.max_age=5000) be removed. (There is
>>>> no separate reaper task running for this).
>>>>
>>>> So, yes, this can happen, but on the next discovery round, we'll have
>>>> the correct values. Again, as I said, UUID.cache is not as important as
>>>> TP.logical_addr_cache.
>>>>
>>>>
>>>>> This is running the Transactional benchmark, so it would be simpler if
>>>>> we enabled PING trace in the configuration and disabled it before the
>>>>> actual benchmark starts. I'm going to try it myself :)
>>>>
>>>>
>>>>
>>>> How do you run 12 instances ? Did you change something in the config ?
>>>> I'd be interested in trying the *exact* same config you're running, to
>>>> see what's going on !
>>>
>>>
>>> Pushed it for you, committing the exact configuration changes as well:
>>>
>>> git clone git://github.com/Sanne/InfinispanStartupBenchmark.git
>>> cd InfinispanStartupBenchmark
>>> git co my
>>> sh bench.sh
>>>
>>> If you look into bench.sh
>>> (as it is at
>>> https://github.com/Sanne/InfinispanStartupBenchmark/blob/my/bench.sh
>>> )
>>> the lines 9, 15, 28 should be the most interesting.
>>>
>>> You need to run with Infinispan's default configuration to reproduce the
>>> issue,
>>> but I wouldn't mind you commenting on
>>>
>>> https://github.com/Sanne/InfinispanStartupBenchmark/blob/my/benchmark/src/main/resources/bench-jgroups.xml
>>> as well. That's what I'm now using as default for further performance
>>> tests.
>>>
>>> Cheers,
>>> Sanne
>>>
>>>>
>>>>
>>>> [1] http://www.jgroups.org/manual-3.x/html/protlist.html#MERGE3
>>>>
>>>> --
>>>> Bela Ban
>>>> Lead JGroups (http://www.jgroups.org)
>>>> JBoss / Red Hat
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> --
>> Bela Ban
>> Lead JGroups (http://www.jgroups.org)
>> JBoss / Red Hat
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev