[jboss-jira] [JBoss JIRA] Commented: (JGRP-1006) Shared transport: shunned multiple channels can not reconnect

Wed Jul 1 10:15:52 EDT 2009

    [ https://jira.jboss.org/jira/browse/JGRP-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12474459#action_12474459 ] 

Vladimir Blagojevic commented on JGRP-1006:
-------------------------------------------

Added test method: SharedTransportTest#testShunReconnect()

> Shared transport: shunned multiple channels can not reconnect 
> --------------------------------------------------------------
>
>                 Key: JGRP-1006
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1006
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6
>            Reporter: Vladimir Blagojevic
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.6.11
>
>
> Brian Stansberry said:
> One of the EJB3/AS integration testsuite runs hit a JGroups problem last 
> night. *Looks like* some issue between the two AS nodes (127.0.0.1 and 
> 127.0.0.2) in the test caused the channels in one node to be 
> suspected/reconnect. Which didn't go well -- which is the subject of 
> this thread.
> I have the server logs from both, which record at INFO or above; if 
> anyone wants them, ping me. Here are the highlights:
> 1) Both nodes connect 4 channels as part of AS startup: 3 using a shared 
> transport UDP stack (for JBC, HAPartition, JBM) and one TCP one for JBM.
> 2) A bunch of test cases then execute, each deploys an EJB3 jar on each 
> node in turn, which in turn causes another JBC channel to be created and 
> connect at test start and disconnect and close at test end. This channel 
> uses the same shared  transport as the other 3, so now there are four 
> channels using that shared transport.
> 3) This create/close cycle continues a while, until on one node there is 
> some issue that results in one of the 4 UDP channels suspecting its peer 
> and all 4 UDP channels getting a new view:
> 13:23:58,150 INFO  [DefaultPartition] Suspected member: 127.0.0.1:40089
> 13:23:58,257 INFO  [GroupMember] 
> org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener at 4cfcb4de 
> got new view [127.0.0.2:42088|2] [127.0.0.2:42088], old view is 
> [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088]
> 13:23:58,257 INFO  [GroupMember] I am (127.0.0.2:42088)
> 13:23:58,265 INFO  [DefaultPartition] New cluster view for partition 
> DefaultPartition (id: 2, delta: -1) : [127.0.0.2:1099]
> 13:23:58,275 INFO  [MessagingPostOffice] JBoss Messaging is failing over 
> for failed node 0. If there are many messages to reload this may take 
> some time...
> 13:23:58,442 INFO  [DefaultPartition] I am (127.0.0.2:1099) received 
> membershipChanged event:
> 13:23:58,442 INFO  [DefaultPartition] Dead members: 1 ([127.0.0.1:1099])
> 13:23:58,442 INFO  [DefaultPartition] New Members : 0 ([])
> 13:23:58,442 INFO  [DefaultPartition] All Members : 1 ([127.0.0.2:1099])
> 13:23:58,581 INFO  [RPCManagerImpl] Received new cluster view: 
> [127.0.0.2:42088|2] [127.0.0.2:42088]
> 13:23:58,620 INFO  [RPCManagerImpl] Received new cluster view: 
> [127.0.0.2:42088|2] [127.0.0.2:42088]
> 13:24:00,260 WARN  [GMS] 127.0.0.2:42088 failed to collect all ACKs (1) 
> for mcasted view [127.0.0.2:42088|2] [127.0.0.2:42088] after 2000ms, 
> missing ACKs from [127.0.0.2:42088] (received=[]), 
> local_addr=127.0.0.2:42088
> 13:24:00,661 INFO  [MessagingPostOffice] JBoss Messaging failover completed
> 13:24:00,662 INFO  [GroupMember] Dead members: 1 ([127.0.0.1:40089])
> 13:24:00,662 INFO  [GroupMember] All Members : 1 ([127.0.0.2:42088])
> 4) At some point the suspected node realizes it is suspected and the 4 
> UDP channels decide to reconnect, two at about the same time and the 
> other two w/in 55 ms of the first two:
> 13:24:39,294 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,295 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,315 ERROR [ConnectionTable] failed sending data to 
> 127.0.0.2:7900: java.net.SocketException: Broken pipe
> 13:24:39,325 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 13:24:39,325 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 13:24:39,339 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 13:24:39,359 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,360 WARN  [FD] I was suspected by 127.0.0.2:42088; ignoring the 
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,380 WARN  [GMS] I (127.0.0.1:40089) am not a member of view 
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the 
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view 
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 5) This is where things break down.  It looks like one of the 4 channels 
> is able to get a local_addr, but the others get null. Perhaps a 
> concurrency bug in Configurator.startProtocolStack? (Note the first 2 
> lines below have to do with the independent TCP channel used by JBM; I 
> post them for completeness.)
> 13:24:39,568 ERROR [ConnectionTable] failed sending data to 
> 127.0.0.2:7900: java.net.SocketException: Broken pipe
> 13:24:39,959 ERROR [ConnectionTable] failed sending data to 
> 127.0.0.2:7900: java.net.SocketException: Broken pipe
> 13:24:40,833 WARN  [GMS]
> 13:24:40,834 WARN  [GMS]
> 13:24:41,362 INFO  [STDOUT]
> ---------------------------------------------------------
> GMS: address is 127.0.0.1:56000 (cluster=DefaultPartition-SFSBCache)
> ---------------------------------------------------------
> 13:24:41,362 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> 	at org.jgroups.JChannel.startStack(JChannel.java:1631)
> 	at org.jgroups.JChannel.connect(JChannel.java:366)
> 	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> 13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> 	at org.jgroups.JChannel.startStack(JChannel.java:1631)
> 	at org.jgroups.JChannel.connect(JChannel.java:366)
> 	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> 13:24:41,488 INFO  [RPCManagerImpl] Received new cluster view: 
> [127.0.0.2:42088|3] [127.0.0.2:42088, 127.0.0.1:56000]
> 13:24:41,564 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,564 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> 	at org.jgroups.JChannel.startStack(JChannel.java:1631)
> 	at org.jgroups.JChannel.connect(JChannel.java:366)
> 	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> 13:24:41,927 INFO  [LegacyStateTransferIntegrator] Using version 4096
> 13:24:45,941 WARN  [GMS] Merge aborted. Merge leader did not get 
> MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
> 13:24:45,993 WARN  [GMS] Merge aborted. Merge leader did not get 
> MergeData from all subgroup coordinators [127.0.0.2:7900, 127.0.0.1:7900]
> 13:24:45,995 WARN  [GMS] merge was supposed to be cancelled at merge 
> participant 127.0.0.1:7900 (merge_id=[127.0.0.1:7900|1246037080987]), 
> but it is not since merge ids do not match
> 13:24:46,029 WARN  [GMS] Merge aborted. Merge leader did not get 
> MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
> 13:24:46,029 WARN  [GMS] resume([127.0.0.1:40089|1246037080833]) does 
> not match null
> 6) Thereafter, there is continual looping of the CloserThread as it 
> tries to connect a channel, fails due to local_addr == null, waits 1 sec 
> and tries again, which goes on for hours:
> 13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> 	at org.jgroups.JChannel.startStack(JChannel.java:1631)
> 	at org.jgroups.JChannel.connect(JChannel.java:366)
> 	at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> As I said, I suspect a concurrency issue that's resulting in one channel 
> getting a local_addr but thereafter others with the same shared 
> transport get null.
> Hmm, just noticed: the looping in #6 above repeats every 30 secs, not 
> every second. Which is the timeout in JChannel.startStack() :
> /* Wait LOCAL_ADDR_TIMEOUT milliseconds for local_addr to have a 
> non-null value (set by SET_LOCAL_ADDRESS) */
> local_addr=local_addr_promise.getResult(LOCAL_ADDR_TIMEOUT);
> if(local_addr == null) {
>      log.fatal("local_addr is null; cannot connect");
>      throw new ChannelException("local_addr is null");
> }
> So, seems the recurring problem is local_addr_promise isn't getting set, 
> not that it is getting set, but with a null value. *However* the *first* 
> set of FATAL logging occurs just a couple seconds after GMS logs it is 
> going to reconnect. So that first issue where local_addr == null is not 
> due to a timeout.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira