[jboss-jira] [JBoss JIRA] Commented: (JGRP-1006) Shared transport: shunned multiple channels can not reconnect
Vladimir Blagojevic (JIRA)
jira-events at lists.jboss.org
Wed Jul 1 10:15:52 EDT 2009
[ https://jira.jboss.org/jira/browse/JGRP-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12474459#action_12474459 ]
Vladimir Blagojevic commented on JGRP-1006:
-------------------------------------------
Added test method: SharedTransportTest#testShunReconnect()
> Shared transport: shunned multiple channels can not reconnect
> --------------------------------------------------------------
>
> Key: JGRP-1006
> URL: https://jira.jboss.org/jira/browse/JGRP-1006
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.6
> Reporter: Vladimir Blagojevic
> Assignee: Vladimir Blagojevic
> Fix For: 2.6.11
>
>
> Brian Stansberry said:
> One of the EJB3/AS integration testsuite runs hit a JGroups problem last
> night. *Looks like* some issue between the two AS nodes (127.0.0.1 and
> 127.0.0.2) in the test caused the channels in one node to be
> suspected/reconnect. Which didn't go well -- which is the subject of
> this thread.
> I have the server logs from both, which record at INFO or above; if
> anyone wants them, ping me. Here are the highlights:
> 1) Both nodes connect 4 channels as part of AS startup: 3 using a shared
> transport UDP stack (for JBC, HAPartition, JBM) and one TCP one for JBM.
> 2) A bunch of test cases then execute, each deploys an EJB3 jar on each
> node in turn, which in turn causes another JBC channel to be created and
> connect at test start and disconnect and close at test end. This channel
> uses the same shared transport as the other 3, so now there are four
> channels using that shared transport.
> 3) This create/close cycle continues a while, until on one node there is
> some issue that results in one of the 4 UDP channels suspecting its peer
> and all 4 UDP channels getting a new view:
> 13:23:58,150 INFO [DefaultPartition] Suspected member: 127.0.0.1:40089
> 13:23:58,257 INFO [GroupMember]
> org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener at 4cfcb4de
> got new view [127.0.0.2:42088|2] [127.0.0.2:42088], old view is
> [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088]
> 13:23:58,257 INFO [GroupMember] I am (127.0.0.2:42088)
> 13:23:58,265 INFO [DefaultPartition] New cluster view for partition
> DefaultPartition (id: 2, delta: -1) : [127.0.0.2:1099]
> 13:23:58,275 INFO [MessagingPostOffice] JBoss Messaging is failing over
> for failed node 0. If there are many messages to reload this may take
> some time...
> 13:23:58,442 INFO [DefaultPartition] I am (127.0.0.2:1099) received
> membershipChanged event:
> 13:23:58,442 INFO [DefaultPartition] Dead members: 1 ([127.0.0.1:1099])
> 13:23:58,442 INFO [DefaultPartition] New Members : 0 ([])
> 13:23:58,442 INFO [DefaultPartition] All Members : 1 ([127.0.0.2:1099])
> 13:23:58,581 INFO [RPCManagerImpl] Received new cluster view:
> [127.0.0.2:42088|2] [127.0.0.2:42088]
> 13:23:58,620 INFO [RPCManagerImpl] Received new cluster view:
> [127.0.0.2:42088|2] [127.0.0.2:42088]
> 13:24:00,260 WARN [GMS] 127.0.0.2:42088 failed to collect all ACKs (1)
> for mcasted view [127.0.0.2:42088|2] [127.0.0.2:42088] after 2000ms,
> missing ACKs from [127.0.0.2:42088] (received=[]),
> local_addr=127.0.0.2:42088
> 13:24:00,661 INFO [MessagingPostOffice] JBoss Messaging failover completed
> 13:24:00,662 INFO [GroupMember] Dead members: 1 ([127.0.0.1:40089])
> 13:24:00,662 INFO [GroupMember] All Members : 1 ([127.0.0.2:42088])
> 4) At some point the suspected node realizes it is suspected and the 4
> UDP channels decide to reconnect, two at about the same time and the
> other two w/in 55 ms of the first two:
> 13:24:39,294 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,295 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,315 ERROR [ConnectionTable] failed sending data to
> 127.0.0.2:7900: java.net.SocketException: Broken pipe
> 13:24:39,325 WARN [GMS] I (127.0.0.1:40089) am not a member of view
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 13:24:39,325 WARN [GMS] I (127.0.0.1:40089) am not a member of view
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 13:24:39,339 WARN [GMS] I (127.0.0.1:40089) am not a member of view
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 13:24:39,359 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,360 WARN [FD] I was suspected by 127.0.0.2:42088; ignoring the
> SUSPECT message and sending back a HEARTBEAT_ACK
> 13:24:39,380 WARN [GMS] I (127.0.0.1:40089) am not a member of view
> [127.0.0.2:42088|2] [127.0.0.2:42088], shunning myself and leaving the
> group (prev_members are [127.0.0.1:40089, 127.0.0.2:42088], current view
> is [127.0.0.1:40089|1] [127.0.0.1:40089, 127.0.0.2:42088])
> 5) This is where things break down. It looks like one of the 4 channels
> is able to get a local_addr, but the others get null. Perhaps a
> concurrency bug in Configurator.startProtocolStack? (Note the first 2
> lines below have to do with the independent TCP channel used by JBM; I
> post them for completeness.)
> 13:24:39,568 ERROR [ConnectionTable] failed sending data to
> 127.0.0.2:7900: java.net.SocketException: Broken pipe
> 13:24:39,959 ERROR [ConnectionTable] failed sending data to
> 127.0.0.2:7900: java.net.SocketException: Broken pipe
> 13:24:40,833 WARN [GMS]
> 13:24:40,834 WARN [GMS]
> 13:24:41,362 INFO [STDOUT]
> ---------------------------------------------------------
> GMS: address is 127.0.0.1:56000 (cluster=DefaultPartition-SFSBCache)
> ---------------------------------------------------------
> 13:24:41,362 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> at org.jgroups.JChannel.startStack(JChannel.java:1631)
> at org.jgroups.JChannel.connect(JChannel.java:366)
> at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> 13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> at org.jgroups.JChannel.startStack(JChannel.java:1631)
> at org.jgroups.JChannel.connect(JChannel.java:366)
> at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> 13:24:41,488 INFO [RPCManagerImpl] Received new cluster view:
> [127.0.0.2:42088|3] [127.0.0.2:42088, 127.0.0.1:56000]
> 13:24:41,564 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,564 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> at org.jgroups.JChannel.startStack(JChannel.java:1631)
> at org.jgroups.JChannel.connect(JChannel.java:366)
> at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> 13:24:41,927 INFO [LegacyStateTransferIntegrator] Using version 4096
> 13:24:45,941 WARN [GMS] Merge aborted. Merge leader did not get
> MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
> 13:24:45,993 WARN [GMS] Merge aborted. Merge leader did not get
> MergeData from all subgroup coordinators [127.0.0.2:7900, 127.0.0.1:7900]
> 13:24:45,995 WARN [GMS] merge was supposed to be cancelled at merge
> participant 127.0.0.1:7900 (merge_id=[127.0.0.1:7900|1246037080987]),
> but it is not since merge ids do not match
> 13:24:46,029 WARN [GMS] Merge aborted. Merge leader did not get
> MergeData from all subgroup coordinators [127.0.0.2:42088, 127.0.0.1:40089]
> 13:24:46,029 WARN [GMS] resume([127.0.0.1:40089|1246037080833]) does
> not match null
> 6) Thereafter, there is continual looping of the CloserThread as it
> tries to connect a channel, fails due to local_addr == null, waits 1 sec
> and tries again, which goes on for hours:
> 13:24:41,363 FATAL [JChannel] local_addr is null; cannot connect
> 13:24:41,365 ERROR [JChannel] failure reconnecting to channel, retrying
> org.jgroups.ChannelException: local_addr is null
> at org.jgroups.JChannel.startStack(JChannel.java:1631)
> at org.jgroups.JChannel.connect(JChannel.java:366)
> at org.jgroups.JChannel$CloserThread.run(JChannel.java:2046)
> As I said, I suspect a concurrency issue that's resulting in one channel
> getting a local_addr but thereafter others with the same shared
> transport get null.
> Hmm, just noticed: the looping in #6 above repeats every 30 secs, not
> every second. Which is the timeout in JChannel.startStack() :
> /* Wait LOCAL_ADDR_TIMEOUT milliseconds for local_addr to have a
> non-null value (set by SET_LOCAL_ADDRESS) */
> local_addr=local_addr_promise.getResult(LOCAL_ADDR_TIMEOUT);
> if(local_addr == null) {
> log.fatal("local_addr is null; cannot connect");
> throw new ChannelException("local_addr is null");
> }
> So, seems the recurring problem is local_addr_promise isn't getting set,
> not that it is getting set, but with a null value. *However* the *first*
> set of FATAL logging occurs just a couple seconds after GMS logs it is
> going to reconnect. So that first issue where local_addr == null is not
> due to a timeout.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list