[jboss-jira] [JBoss JIRA] Created: (JGRP-1006) Shared transport: shunned multiple channels can not reconnect

Wed Jul 1 05:12:31 EDT 2009

Shared transport: shunned multiple channels can not reconnect 
--------------------------------------------------------------

                 Key: JGRP-1006
                 URL: https://jira.jboss.org/jira/browse/JGRP-1006
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6
            Reporter: Vladimir Blagojevic
            Assignee: Vladimir Blagojevic
             Fix For: 2.6.11

Brian Stansberry said:

One of the EJB3/AS integration testsuite runs hit a JGroups problem last 
night. *Looks like* some issue between the two AS nodes (127.0.0.1 and 
127.0.0.2) in the test caused the channels in one node to be 
suspected/reconnect. Which didn't go well -- which is the subject of 
this thread.

I have the server logs from both, which record at INFO or above; if 
anyone wants them, ping me. Here are the highlights:

1) Both nodes connect 4 channels as part of AS startup: 3 using a shared 
transport UDP stack (for JBC, HAPartition, JBM) and one TCP one for JBM.

2) A bunch of test cases then execute, each deploys an EJB3 jar on each 
node in turn, which in turn causes another JBC channel to be created and 
connect at test start and disconnect and close at test end. This channel 
uses the same shared  transport as the other 3, so now there are four 
channels using that shared transport.

3) This create/close cycle continues a while, until on one node there is 
some issue that results in one of the 4 UDP channels suspecting its peer 
and all 4 UDP channels getting a new view:

13:23:58,150 INFO  [DefaultPartition] Suspected member: 127.0.0.1:40089
13:23:58,257 INFO  [GroupMember] 
org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener at 4cfcb4de 
got new view [127.0.0.2:42088|2] [127.0.0.2:42088], old view is 
[127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088]
13:23:58,257 INFO  [GroupMember] I am (127.0.0.2:42088)
13:23:58,265 INFO  [DefaultPartition] New cluster view for partition 
DefaultPartition (id: 2, delta: -1) : [127.0.0.2:1099]
13:23:58,275 INFO  [MessagingPostOffice] JBoss Messaging is failing over 
for failed node 0. If there are many messages to reload this may take 
some time...
13:23:58,442 INFO  [DefaultPartition] I am (127.0.0.2:1099) received 
membershipChanged event:
13:23:58,442 INFO  [DefaultPartition] Dead members: 1 ([127.0.0.1:1099])
13:23:58,442 INFO  [DefaultPartition] New Members : 0 ([])
13:23:58,442 INFO  [DefaultPartition] All Members : 1 ([127.0.0.2:1099])
13:23:58,581 INFO  [RPCManagerImpl] Received new cluster view: 
[127.0.0.2:42088|2] [127.0.0.2:42088]
13:23:58,620 INFO  [RPCManagerImpl] Received new cluster view: 
[127.0.0.2:42088|2] [127.0.0.2:42088]
13:24:00,260 WARN  [GMS] 127.0.0.2:42088 failed to collect all ACKs (1) 
for mcasted view [127.0.0.2:42088|2] [127.0.0.2:42088] after 2000ms, 
missing ACKs from [127.0.0.2:42088] (received=[]), 
local_addr=127.0.0.2:42088
13:24:00,661 INFO  [MessagingPostOffice] JBoss Messaging failover completed
13:24:00,662 INFO  [GroupMember] Dead members: 1 ([127.0.0.1:40089])
13:24:00,662 INFO  [GroupMember] All Members : 1 ([127.0.0.2:42088])

4) At some point the suspected node realizes it is suspected and the 4 
UDP channels decide to reconnect, two at about the same time and the 
other two w/in 55 ms of the first two:

13:24:39,294 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,295 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,315 ERROR [ConnectionTable] failed sending data to 
127.0.0.2:7900: java.net.SocketException: Broken pipe
13:24:39,325 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
13:24:39,325 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
13:24:39,339 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
13:24:39,359 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,360 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
SUSPECT message and sending back a HEARTBEAT_ACK
13:24:39,380 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
[127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])

5) This is where things break down.  It looks like one of the 4 channels 
is able to get a local_addr, but the others get null. Perhaps a 
concurrency bug in Configurator.startProtocolStack? (Note the first 2 
lines below have to do with the independent TCP channel used by JBM; I 
post them for completeness.)

13:24:39,568 ERROR [ConnectionTable] failed sending data to 
127.0.0.2:7900: java.net.SocketException: Broken pipe
13:24:39,959 ERROR [ConnectionTable] failed sending data to 
127.0.0.2:7900: java.net.SocketException: Broken pipe
13:24:40,833 WARN  [GMS]
13:24:40,834 WARN  [GMS]
13:24:41,362 INFO  [STDOUT]
---------------------------------------------------------
GMS: address is 127.0.0.1:56000 (cluster=DefaultPartition-SFSBCache)
---------------------------------------------------------
13:24:41,362 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
	at org.jgroups.JChannel.startStack(JChannel.java:1631)
	at org.jgroups.JChannel.connect(JChannel.java:366)
	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
	at org.jgroups.JChannel.startStack(JChannel.java:1631)
	at org.jgroups.JChannel.connect(JChannel.java:366)
	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
13:24:41,488 INFO  [RPCManagerImpl] Received new cluster view: 
[127.0.0.2:42088|3] [127.0.0.2:42088, 127.0.0.1:56000]
13:24:41,564 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,564 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
	at org.jgroups.JChannel.startStack(JChannel.java:1631)
	at org.jgroups.JChannel.connect(JChannel.java:366)
	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
13:24:41,927 INFO  [LegacyStateTransferIntegrator] Using version 4096
13:24:45,941 WARN  [GMS] Merge aborted. Merge leader did not get 
MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
13:24:45,993 WARN  [GMS] Merge aborted. Merge leader did not get 
MergeData from all subgroup coordinators [127.0.0.2:7900, 127.0.0.1:7900]
13:24:45,995 WARN  [GMS] merge was supposed to be cancelled at merge 
participant 127.0.0.1:7900 (merge_id=[127.0.0.1:7900|1246037080987]), 
but it is not since merge ids do not match
13:24:46,029 WARN  [GMS] Merge aborted. Merge leader did not get 
MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
13:24:46,029 WARN  [GMS] resume([127.0.0.1:40089|1246037080833]) does 
not match null

6) Thereafter, there is continual looping of the CloserThread as it 
tries to connect a channel, fails due to local_addr == null, waits 1 sec 
and tries again, which goes on for hours:

13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
	at org.jgroups.JChannel.startStack(JChannel.java:1631)
	at org.jgroups.JChannel.connect(JChannel.java:366)
	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)

As I said, I suspect a concurrency issue that's resulting in one channel 
getting a local_addr but thereafter others with the same shared 
transport get null.

Hmm, just noticed: the looping in #6 above repeats every 30 secs, not 
every second. Which is the timeout in JChannel.startStack() :

/* Wait LOCAL_ADDR_TIMEOUT milliseconds for local_addr to have a 
non-null value (set by SET_LOCAL_ADDRESS) */
local_addr=local_addr_promise.getResult(LOCAL_ADDR_TIMEOUT);
if(local_addr == null) {
     log.fatal("local_addr is null; cannot connect");
     throw new ChannelException("local_addr is null");
}

So, seems the recurring problem is local_addr_promise isn't getting set, 
not that it is getting set, but with a null value. *However* the *first* 
set of FATAL logging occurs just a couple seconds after GMS logs it is 
going to reconnect. So that first issue where local_addr == null is not 
due to a timeout.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira