[JBoss JIRA] (JGRP-1427) Race conditions during TCP start up cause broken connections and cluster-wide slow down
by Jay Guidos (JIRA)
Jay Guidos created JGRP-1427:
--------------------------------
Summary: Race conditions during TCP start up cause broken connections and cluster-wide slow down
Key: JGRP-1427
URL: https://issues.jboss.org/browse/JGRP-1427
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.5
Reporter: Jay Guidos
Assignee: Bela Ban
I actually found this on 2.4.1, but I can reproduce it on the Git master.
When starting our cluster of 18 nodes using TCP, perhaps one start in 20 results in a node having very slow cluster response. Some trace logging eventually revealed that on random occasions one of the TCP connections to a sister node would start and then immediate terminate. From that point on, every cluster broadcast would be subject to the response timeout (for us it was at 60 seconds).
The problem is in org.jgroups.blocks.BasicConnectionTable. There is a number of non-threadsafe field reference updates during startup, but in particular in BasicConnectionTable.Sender.start() the 'senderThread' field is updated in one thread, and immediately referenced in the daughter thread's invocation of BasicConnectionTable.Sender.run().
If the timing works out wrong, 'senderThread' is set in the L2 cache of the PingSender thread, but not yet updated in the L2 cache used by ConnectionTable.Connection.Sender, and hence the run() method exits immediately. Here is the smoking gun from a trace log:
{noformat}
22:18:22,985 INFO [org.jgroups.blocks.ConnectionTable] {main} server socket created on 127.0.0.1:7800
22:18:22,997 DEBUG [org.jgroups.blocks.MessageDispatcher$ProtocolAdapter] {main} setting local_addr (null) to 127.0.0.1:7800
22:18:23,013 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Receiver started
22:18:23,013 INFO [org.jgroups.blocks.ConnectionTable] {PingSender} created socket to 127.0.0.1:7801
22:18:23,015 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Sender thread started
22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Receiver started
22:18:23,018 INFO [org.jgroups.blocks.ConnectionTable] {PingSender} created socket to 127.0.0.1:7802
22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {ConnectionTable.Connection.Sender [127.0.0.1:44710 - 127.0.0.1:7802]} ConnectionTable.Connection.Sender thread terminated
22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Sender thread started
22:18:23,019 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Receiver started
2
{noformat}
You can see that at 22:18:23,018 the PingSender thread created a Sender, and in the same millisecond the ConnectionTable.Connection.Sender thread killed it off.
The fix is easy, just convert 'senderThread' to an AtomicReference. This is very old code, I am not sure why we were the first ones to report this? Perhaps it is because our application is heavily multithreaded and we run on servers with 16 physical CPU cores. There is a very high chance that PingSender and ConnectionTable.Connection.Sender were not bound to the same core, and hence their L2 caches had significant time intervals where they were out of sync. There were a number of other sloppy references and updates to variables that were used for thread control in BasicConnectionTable, I think this code could benefit from a review for that kind of thing.
Cheers!
Jay Guidos
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
14 years, 3 months
[JBoss JIRA] (JGRP-1431) GossipRouter drops members from the routing table
by Guy Golan (JIRA)
Guy Golan created JGRP-1431:
-------------------------------
Summary: GossipRouter drops members from the routing table
Key: JGRP-1431
URL: https://issues.jboss.org/browse/JGRP-1431
Project: JGroups
Issue Type: Bug
Affects Versions: 3.1
Reporter: Guy Golan
Assignee: Bela Ban
I have looked at GossipRouter.java and it seems like it has a bug that sometimes drops new members from the routing table.
The problem is caused by the methods: "removeEntry" and "handleConnect".
Specifically, the method "removeEntry" contains the following code fragment:
if(map.isEmpty()) {
routingTable.remove(group);
So, if "removeEntry" is executed and a concurrent "handleConnect" adds a member (to the same group) between "map.isEmpty()" and "routingTable.remove(group)" then this member will be dropped from the routing-table.
Note that, in this case the method "handleConnect" still sends "CONNECT_OK" to the client.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
14 years, 3 months
[JBoss JIRA] Created: (AS7-1722) dropped message WARNING on node leaving a group
by Radoslav Husar (JIRA)
dropped message WARNING on node leaving a group
-----------------------------------------------
Key: AS7-1722
URL: https://issues.jboss.org/browse/AS7-1722
Project: Application Server 7
Issue Type: Bug
Components: Clustering
Affects Versions: 7.0.1.Final, 7.0.0.Final
Reporter: Radoslav Husar
Assignee: Paul Ferraro
On node leaving the group there is a warning
{code}
17:57:31,340 INFO [org.jboss.as.clustering.CoreGroupCommunicationService.lifecycle.web] (Incoming-13,web,rhusar-7808) New cluster view for partition web (id: 4, delta: -1, merge: false) : [rhusar-7808]
17:57:31,340 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-13,web,rhusar-7808) ISPN000094: Received new cluster view: [rhusar-7808|4] [rhusar-7808]
17:57:31,345 WARNING [org.jgroups.protocols.pbcast.NAKACK] (Incoming-15,web,rhusar-7808) rhusar-7808: dropped message from rhusar-41106 (not in table [rhusar-7808]), view=[rhusar-7808|4] [rhusar-7808]
{code}
because it removes the member before receiving the last message
{code}
17:57:31,286 INFO [org.jboss.as.clustering.infinispan.subsystem] Stopped repl cache from web container
17:57:31,290 INFO [jacorb.orb] prepare ORB for shutdown...
17:57:31,290 INFO [jacorb.orb] ORB going down...
17:57:31,305 INFO [jacorb.orb.iiop] Listener exited
17:57:31,310 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000080: Disconnecting and closing JGroups Channel
17:57:31,312 INFO [org.hornetq.core.server.impl.HornetQServerImpl] HornetQ Server version 2.2.7.Final (HQ_2_2_7_FINAL_AS7, 121) [8ee2c584-d740-11e0-ad66-0022fabb0b50] stopped
17:57:31,310 INFO [jacorb.orb] ORB shutdown complete
17:57:31,313 INFO [jacorb.orb] ORB run, exit
17:57:31,349 INFO [org.jboss.as.server.deployment] Stopped deployment SessionTest-2.0-SNAPSHOT.war in 197ms
17:57:31,636 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000082: Stopping the RpcDispatcher
17:57:31,637 INFO [com.arjuna.ats.jbossatx] ARJUNA32018: Destroying TransactionManagerService
17:57:31,638 INFO [com.arjuna.ats.jbossatx] ARJUNA32014: Stopping transaction recovery manager
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
14 years, 3 months
[JBoss JIRA] (AS7-2407) Cannot connect to domain controller - more detail needed
by Kevin Barfield (Created) (JIRA)
Cannot connect to domain controller - more detail needed
--------------------------------------------------------
Key: AS7-2407
URL: https://issues.jboss.org/browse/AS7-2407
Project: Application Server 7
Issue Type: Bug
Components: Domain Management
Reporter: Kevin Barfield
Assignee: Brian Stansberry
There needs to be more detail when a host controller can't connect to a domain controller. Several times we saw the "cannot connect to the domain controller" message when we knew the two servers could see each other. The host tried 5-6 times to connect then shut down. There were no other messages on the host or domain controller with more detail (even at DEBUG logging level). We could stop the domain controller while the host was trying to connect and we would see an immediate error message on the host.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
14 years, 3 months
[JBoss JIRA] Created: (JASSIST-44) Maven profiles for tools.jar should take account of Mac OS
by Martin Burger (JIRA)
Maven profiles for tools.jar should take account of Mac OS
----------------------------------------------------------
Key: JASSIST-44
URL: http://jira.jboss.com/jira/browse/JASSIST-44
Project: Javassist
Issue Type: Bug
Environment: Maven version: 2.0.8
Java version: 1.5.0_13
OS name: "mac os x" version: "10.4.11" arch: "i386" Family: "unix"
Reporter: Martin Burger
Assigned To: Shigeru Chiba
The pom.xml uses different profiles to add the tools.jar to the dependencies. However, it is already included in the runtime for Mac OS X and some free JDKs and does not exist as a separate file 'tools.jar'. See: http://maven.apache.org/general.html#tools-jar-dependency
As soon as http://jira.codehaus.org/browse/MNG-3106 gets fixed, activation should look like the following example:
<activation>
<jdk>1.6</jdk>
<property>
<name>java.vendor</name>
<value>Sun Microsystems Inc.</value>
</property>
</activation>
In the meantime
<profiles>
<profile>
<id>tools.jar</id>
<activation>
<property>
<name>java.vendor</name>
<value>Sun Microsystems Inc.</value>
</property>
</activation>
<dependencies>
<dependency>
<groupId>com.sun</groupId>
<artifactId>tools</artifactId>
<version>1.6</version>
<scope>system</scope>
<optional>true</optional>
<systemPath>${java.home}/../lib/tools.jar</systemPath>
</dependency>
</dependencies>
</profile>
</profiles>
should do the job. It could be a permanent solution because the different profiles differ only in the version element.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
14 years, 3 months