]
Tomas Hofman moved JBEAP-15344 to WFLY-10968:
---------------------------------------------
Project: WildFly (was: JBoss Enterprise Application Platform)
Key: WFLY-10968 (was: JBEAP-15344)
Workflow: GIT Pull Request workflow (was: CDW with loose statuses v1)
Component/s: JMS
(was: ActiveMQ)
Target Release: (was: 7.backlog.GA)
Affects Version/s: 14.0.0.Final
(was: 7.1.0.GA)
(was: 7.2.0.GA)
Backup doesn't activate after shared store is reconnected
---------------------------------------------------------
Key: WFLY-10968
URL:
https://issues.jboss.org/browse/WFLY-10968
Project: WildFly
Issue Type: Bug
Components: JMS
Affects Versions: 14.0.0.Final
Environment: NFS configuration
{noformat}
messaging-10.jbm.lab.bos.redhat.com:/hornetq on /mnt/hornetq/client type nfs4
(rw,nosuid,nodev,relatime,sync,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,nosharecache,proto=tcp,timeo=50,retrans=5,sec=sys,clientaddr=10.16.100.40,lookupcache=none,local_lock=none,addr=10.16.100.24)
{noformat}
Java version
{noformat}
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode
{noformat}
Reporter: Tomas Hofman
Assignee: Tomas Hofman
Priority: Critical
*Scenario*
# Start live backup server pair in dedicated topology with shared store HA, with journal
located on NFS
# NFS mounted on backup server fails
# Reconnect NFS on backup server
# Try to shut down live EAP server
# Backup doesn't activate
*What happens*
Backup is waiting for live to fail by checking its file lock. In case the connection to
shared storage fails, backup logs following error.
{noformat}
05:50:57,896 ERROR [org.apache.activemq.artemis.core.server] (AMQ119000: Activation for
server ActiveMQServerImpl::serverUUID=836c9b1e-f067-11e7-8763-001b21862475) AMQ224000:
Failure in initialisation: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_151]
at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) [rt.jar:1.8.0_151]
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) [rt.jar:1.8.0_151]
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:299)
[artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:316)
[artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:127)
[artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]
at
org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
[artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]
at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2496)
[artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]
{noformat}
Exception is caught in {{SharedStoreBackupActivation.run}}, and causes termination of
backup activation process.
In case the NFS is reconnected later, backup server doesn't continue in activation
process and it doesn't wait for live to fail. In case the live fails, backup
doesn't activate, even though it has a connection to shared storage.
Backup should retry checking live lock even in case the storage is unavailable. It should
log warning/error messages that storage is unavailable, but it should not terminate the
activation process. This would allow backup to continue its duties when the storage is
reconnected.