[JBoss JIRA] Created: (JGRP-1361) NAKACK: use bigger timeouts for big retransmission tasks
by Bela Ban (JIRA)
NAKACK: use bigger timeouts for big retransmission tasks
--------------------------------------------------------
Key: JGRP-1361
URL: https://issues.jboss.org/browse/JGRP-1361
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.1
Oftentimes we receive messages out of order, e.g. because an OOB message follows a range of regular messages, but since the regular messages are bundled, they ay arrive later than the OOB message. Say the regular messages are [10-30] and the OOB message is 31. Receiving #31 before #10-30 triggers the addition of a retransmission task for [10..30] to the retransmitter. The task usually goes off after some initial delay, say 500ms. If we receive [10..30] before the task goes off, it will be cancelled. If we receive most of the messages in [10..30], only the missing messages will get retransmitted when the task fires.
So in most cases, a lot of retransmission tasks will never fire and are cancelled before.
However, sometimes we can receive a seqno which is way larger than the currently highest_received seqno, e.g. highest_received=15000, seqno=20000. This means we now add a retransmission request for [15000-20000]. It is likely that the seqno was just received out of order, but it may trigger the (unneeded) retransmission of 5000 messages !
The suggested solution is therefore to increase the initial delay for large retransmission tasks, such that they execute a bit later. Of course, the underlying assumption is that most of the missing messages will arrive before the timeout goes off. If the 5000 message are really lost, e.g. dropped by the IP stack or a switch, then they will need to get retransmitted.
If we have an exponential_backoff of 500, the initial delay is 500ms. We could say that a 'large' retransmission task is any task which asks for retransmission of more than 10% of the current retransmission table's size. We could compute an offset to the initial delay, which is added to the delay (only the first time), by using the delta and the current delay.
We could add this delta to delay only the first time a retransmission task is scheduled, or we could add this to any retransmission scheduling as long as the task is large.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months
[JBoss JIRA] Created: (JGRP-1362) NAKACK: second line of defense for requested retransmissions that are not found
by Bela Ban (JIRA)
NAKACK: second line of defense for requested retransmissions that are not found
-------------------------------------------------------------------------------
Key: JGRP-1362
URL: https://issues.jboss.org/browse/JGRP-1362
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 2.12.2, 3.1
When the original sender B is asked by A to retransmit message M, but doesn't have M in its retransmission table anymore, it should tell A, or else A will send retransmission requests to B until A or B leave.
This problem should have been fixed by JGRP-1251, but if it turns out it wasn't, then this JIRA is (1) a second line of defense to stop the endless retransmission requests and (2) will give us valuable diagnostic information to fix the underlying problem (should there still be one).
Problem:
- A has a NakReceiverWindow (NRW) of 50 (highest_delivered seqno) for B
- B's NRW, however, is 200. B garbage collected messages up to 150.
- When B sends message 201, A will ask B for retransmission of [51-200]
- B will retransmit messages [150-200], but it cannot send messages 51-149, as it doesn't have them anymore !
- A will add messages [150-200], but its NRW is still 50 (highest_delivered)
- A will continue asking B for messages [51-149] (it does have [150-201])
- This will go on forever, or until B or A leaves
SOLUTION:
- When the *original sender* B of message M receives a retransmission request for M (from A), and it doesn't have M in its retransmission table, it should send back a MSG_NOT_FOUND message to A including B's digest
- When A receives the MSG_NOT_FOUND message, it does the following:
- It logs it own NRW for B
- It logs B's digest
- It logs its digest history
(This information is valuable for investigating the underlying issue)
- Then A's NRW for B is adjusted:
- The highest_delivered seqno is set to B.digest.highest_delivered
- All messages in xmit_table below B.digest.highest_delivered are removed
- All retransmission tasks in the retransmitter <= B.digest.highest_delivered are cancelled and removed
(This will stop the retransmission)
Again, this is a second line of defense, which should never be used. If the underlying problem does occur, however, we'll have valuable information in the logs to diagnose what went wrong.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months
[JBoss JIRA] (JGRP-1396) Merge NakreceiverWindow and Retransmitter
by Bela Ban (Created) (JIRA)
Merge NakreceiverWindow and Retransmitter
-----------------------------------------
Key: JGRP-1396
URL: https://issues.jboss.org/browse/JGRP-1396
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.2
Both NakReceiverWindow and Retransmitter use their own data structures to keep a list of messages received (NRW) and seqnos to be retransmitted (Retrasmitter). This is redundant and costly memory-wise.
I suggest let's merge the 2 classes, or at least let them share the data structure which keeps track of received messages.
Suggestion II: create a ring buffer with a (changeable) capacity that keeps track of received messages and messages to be retransmitted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months
[JBoss JIRA] (JGRP-1402) NAKACK: too much lock contention between sending and receiving messages
by Bela Ban (Created) (JIRA)
NAKACK: too much lock contention between sending and receiving messages
-----------------------------------------------------------------------
Key: JGRP-1402
URL: https://issues.jboss.org/browse/JGRP-1402
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.2
When we have only 1 node in a cluster, sending and receiving messages creates a lot of contention in NakReceiverWindow (NRW). To reproduce:
- Start MPerf
- Press '1' to send 1 million messages
- The throughput is ca 20-30 MB/sec, compared to 140 MB when running multiple instances of MPerf on the same box !
In the profiler, we can see that the write lock in NRW makes up for ca 99% of all the blocking ! Ca. half is caused by NRW.add(), the other half by NRW.removeMany().
The reason is that, when we send a message, it is added to the NRW (add()). The incoming thread then tries to remove as many messages as possible (removeMany()), and blocks messages being added to NRW by the sender, and vice versa; the removeMany() method is blocked accessing the NRW by many add()s.
SOLUTION 1:
- If we only have 1 member in the cluster, call removeMany() immediately after NRW.add() on the sender. No need for a message to be processed by the incoming thread pool, if we're the only member in the cluster
- The downside here is that we don't reduce the contention on NRW if we have more than 1 member: this lock contention may even slow down the case of more than 1 member clusters !
SOLUTION 2:
- Make NRW.add() and remove() more efficient, and contend less on the same lock.
- [1] should help.
[1] https://issues.jboss.org/browse/JGRP-1396
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months
[JBoss JIRA] (AS7-3410) CLONE - Classloading issue with HornetQLoggerFactory - problem with failback
by Miroslav Novak (JIRA)
Miroslav Novak created AS7-3410:
-----------------------------------
Summary: CLONE - Classloading issue with HornetQLoggerFactory - problem with failback
Key: AS7-3410
URL: https://issues.jboss.org/browse/AS7-3410
Project: Application Server 7
Issue Type: Bug
Components: JMS
Affects Versions: 7.1.0.CR1
Reporter: Miroslav Novak
Assignee: Andy Taylor
Priority: Critical
Fix For: 7.1.0.Final
Attachments: console-log-backup-server.txt, console-log-live-server.txt, reproducer.zip
Test scenario:
1. Start two servers AS7/EAP6 servers - live and its backup in dedicated topology - each on different machine
2. Kill live server using "kill -9 ..."
3. Start live server again
In step 3. there are unexpected messages in console log of live and backup server.
>From backup server:
{code}
12:07:26,165 INFO [org.hornetq.core.server.impl.HornetQServerImpl] (Thread-78) HornetQ Server version 2.2.7.Final (HQ_2_2_7_FINAL_AS7, 121) [17700d86-45b2-11e1-a575-d48564b8e1e7] stopped
12:07:26,165 INFO [org.hornetq.core.server.impl.HornetQServerImpl] (Thread-78) unable to restart server, please kill and restart manually: java.lang.IllegalArgumentException: Could not find class org.jboss.as.messaging.HornetQLoggerFactory
at org.hornetq.utils.ClassloadingUtil$1.run(ClassloadingUtil.java:42) [hornetq-core-2.2.7.Final.jar:]
at java.security.AccessController.doPrivileged(Native Method) [:1.6.0_22]
at org.hornetq.utils.ClassloadingUtil.safeInitNewInstance(ClassloadingUtil.java:16) [hornetq-core-2.2.7.Final.jar:]
at org.hornetq.core.server.impl.HornetQServerImpl.instantiateInstance(HornetQServerImpl.java:1868) [hornetq-core-2.2.7.Final.jar:]
at org.hornetq.core.server.impl.HornetQServerImpl.initialiseLogging(HornetQServerImpl.java:1301) [hornetq-core-2.2.7.Final.jar:]
at org.hornetq.core.server.impl.HornetQServerImpl.start(HornetQServerImpl.java:541) [hornetq-core-2.2.7.Final.jar:]
at org.hornetq.core.server.impl.HornetQServerImpl$SharedStoreBackupActivation$1FailbackChecker$1.run(HornetQServerImpl.java:430) [hornetq-core-2.2.7.Final.jar:]
at java.lang.Thread.run(Thread.java:679) [:1.6.0_22]
{code}
>From live server:
{code}
12:07:51,993 INFO [org.jboss.as.messaging] (MSC service thread 1-3) JBAS011601: Bound messaging object to jndi name java:/topic/test
12:07:52,007 INFO [org.jboss.as] (Controller Boot Thread) JBoss EAP 6.0.0.Alpha2 (AS 7.1.0.CR1-redhat-1) started in 31092ms - Started 155 of 263 services (103 services are passive or on-demand)
12:07:53,292 DEBUG [org.hornetq.core.client.impl.ClientSessionFactoryImpl] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Trying reconnection attempt 1
12:07:53,292 DEBUG [org.hornetq.core.remoting.impl.netty.NettyConnector] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Started Netty Connector version 3.2.3.Final-r${buildNumber}
12:07:53,292 DEBUG [org.hornetq.core.client.impl.ClientSessionFactoryImpl] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Trying to connect at the main server using connector :org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5446&host=192-168-10-4
12:07:53,293 DEBUG [org.hornetq.core.client.impl.ClientSessionFactoryImpl] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Main server is not up. Hopefully there's a backup configured now!
12:07:55,293 DEBUG [org.hornetq.core.client.impl.ClientSessionFactoryImpl] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Trying reconnection attempt 2
12:07:55,293 DEBUG [org.hornetq.core.remoting.impl.netty.NettyConnector] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Started Netty Connector version 3.2.3.Final-r${buildNumber}
12:07:55,293 DEBUG [org.hornetq.core.client.impl.ClientSessionFactoryImpl] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Trying to connect at the main server using connector :org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5446&host=192-168-10-4
12:07:55,294 DEBUG [org.hornetq.core.client.impl.ClientSessionFactoryImpl] (Thread-1 (group:HornetQ-client-global-threads-1954829789)) Main server is not up. Hopefully there's a backup
{code}
I'm not fully sure but it looks like that backup server did not manage to get to "waiting for live to fail" state and live server is not able to detect backup.
There is set "DEBUG" level logging for "org.hornetq" in attached logs.
I've attached reproducer.zip - steps to use:
1. Download and unzip "reproducer.zip"
2. Prepare live and backup server - "sh prepare.sh"
3. Start live - "sh start-server1.sh server1_hostname"
4. Start backup - "sh start-server2.sh server2_hostname"
5. Kill live server using "kill -9 server1_process_id"
6. Start live server again - "sh start-server1.sh server1_hostname"
Note:
In reproducer.zip are configuration files standalone-ha-A.xml, standalone-ha-B.xml (A for live, B for backup).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months
[JBoss JIRA] (AS7-3368) jboss-admin tool doesn't return non zero code when command fails to execute
by Rostislav Svoboda (JIRA)
Rostislav Svoboda created AS7-3368:
--------------------------------------
Summary: jboss-admin tool doesn't return non zero code when command fails to execute
Key: AS7-3368
URL: https://issues.jboss.org/browse/AS7-3368
Project: Application Server 7
Issue Type: Feature Request
Components: Domain Management, Scripts
Affects Versions: 7.1.0.CR1b
Reporter: Rostislav Svoboda
Assignee: Brian Stansberry
Priority: Critical
Fix For: 7.1.0.Final
Jboss-admin tool doesn't return non zero code when command fails to execute.
It's necessary for us to know if command was executed properly or not. Parsing output is unacceptable.
{code}
[rsvoboda@rosta-ntb ~]$ TESTING/jboss-as7/bin/jboss-admin.sh --connect command=:shutdown
The controller is not available at localhost:9999
You are disconnected at the moment. Type 'connect' to connect to the server or 'help' for the list of supported commands.
[rsvoboda@rosta-ntb ~]$ echo $?
0
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months
[JBoss JIRA] (AS7-3400) Description for replace-deployment shows deploy
by Erhard Siegl (JIRA)
Erhard Siegl created AS7-3400:
---------------------------------
Summary: Description for replace-deployment shows deploy
Key: AS7-3400
URL: https://issues.jboss.org/browse/AS7-3400
Project: Application Server 7
Issue Type: Bug
Components: ConfigAdmin
Affects Versions: 7.1.0.CR1b
Reporter: Erhard Siegl
Assignee: Thomas Diesler
In jboss-admin:
[domain@localhost:9999 server-group=other-server-group] :read-operation-description(name="replace-deployment")
{
"outcome" => "success",
"result" => {
"operation-name" => "deploy",
"description" => "Deploy the specified deployment content into the runtime, optionally replacing existing content.",
"reply-properties" => {},
"read-only" => false
}
}
This is the description for the "deploy" operation and not for "replace-deployment".
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 8 months