[JBoss JIRA] (WFLY-6781) Wildfly cluster's failover functionality doesn't work as expected

Wednesday, 29 June 2016

    [
https://issues.jboss.org/browse/WFLY-6781?page=com.atlassian.jira.plugin....
] 

Preeta Kuruvilla edited comment on WFLY-6781 at 6/29/16 10:18 AM:
------------------------------------------------------------------

The cluster that we have is a domain managed cluster where we have domain.xml and host.xml
configured on Node1 and only host.xml configured on Node2. The jgroups is a subsystem in
the domain.xml for the profile "ha".

Regarding our application, we have 2 components - RC.war and SL.war. The JMS is configured
on SL. Only component RC is clustered. SL is not clustered.
In domain.xml we have:-

<server-groups>
        <server-group name="main-server-group" profile="ha">
            <socket-binding-group ref="ha-sockets"/>
            <deployments>
                <deployment name="RC.war"
runtime-name="RC.war"/>
            </deployments>
        </server-group>
        <server-group name="other-server-group" profile="ha">
            <socket-binding-group ref="standard-sockets"/>
			<paths>
                 <path name="jboss.server.log.dir.SL"
path="jboss.server.log.dir" />
             </paths>
            <deployments>
                <deployment name="SL.war"
runtime-name="SL.war"/>
            </deployments>
        </server-group>
    </server-groups>

***********************
In host.xml we have:-

Node 1 has - 2 server instances- RC and SL. 

 <servers>		
		<server name="server-host1-RC" group="main-server-group"
auto-start="true">
		<jvm name="default">
		      ............
			</jvm-options>
		</jvm>
		  <socket-bindings socket-binding-group="ha-sockets"
port-offset="0"/>
		  </server> 
		<server name="server-host1-SL" group="other-server-group"
auto-start="true">
		<jvm name="default">
			........
			</jvm-options>
		</jvm>
		 <socket-bindings socket-binding-group="standard-sockets"
port-offset="0"/>
		 </server>       
    </servers>

Node 2 : has only one server instance and that has RC

<servers>		
		<server name="server-host2-RC" group="main-server-group"
auto-start="true">
		<jvm name="default">
			.............
			<jvm-options>

		</jvm>
		<socket-bindings socket-binding-group="ha-sockets"
port-offset="0"/>
		</server>       
    </servers>

Now when I say its not working as expected, when we test failover, I mean - the
communication of RC and SL is broken. After the communication is broken, we have currently
a workaround -restarting the RC and SL on cluster, which makes it work as before.

By the way RC communicates with SL on port 6080 via http.
RC communicates with JMS on SL (JMS JNDI url:http-remoting://<ip address of SL which is
Node1>:6080/)

Just a note:- everything is working properly in production if we don't try the
failover testing scenario by disabling network or powering off etc.

Let me know if you need any other info.

Thanks,
Preeta

was (Author: kpreeta12):
The cluster that we have is a domain managed cluster where we have domain.xml and host.xml
configured on Node1 and only host.xml configured on Node2. The jgroups is a subsystem in
the domain.xml for the profile "ha".

Regarding our application, we have 2 components - RC.war and SL.war. The JMS is configured
on SL. Only component RC is clustered. SL is not clustered.

Node 1 has - 2 server instances- RC and SL. 

 <servers>		
		<server name="server-host1-RC" group="main-server-group"
auto-start="true">
		<jvm name="default">
			<heap size="2048m" max-size="2048m"/>
			<permgen size="512m" max-size="512m"/>
			<jvm-options>
        <!--<option
value="-Xrunjdwp:transport=dt_socket,address=8787,server=y,suspend=n"/>-->
				<option
value="-XX:CompileCommand=exclude,com/newscale/bfw/signon/filters,AuthenticationFilter"/>
				<option
value="-XX:CompileCommand=exclude,org/apache/xml/dtm/ref/sax2dtm/SAX2DTM,startElement"/>
				<option
value="-XX:CompileCommand=exclude,org/exolab/castor/xml/Marshaller,marshal"/>
				<option
value="-XX:CompileCommand=exclude,org/exolab/castor/xml/Marshaller,marshal"/>
				<option
value="-XX:CompileCommand=exclude,org/apache/xpath/compiler/XPathParser,UnionExpr"/>
			</jvm-options>
		</jvm>
		<socket-bindings socket-binding-group="ha-sockets"
port-offset="0"/>
		</server> 
		<server name="server-host1-SL" group="other-server-group"
auto-start="true">
		<jvm name="default">
			<heap size="2048m" max-size="2048m"/>
			<permgen size="512m" max-size="512m"/>
			<jvm-options>
			<option value="-server"/>
			</jvm-options>
		</jvm>
		<socket-bindings socket-binding-group="standard-sockets"
port-offset="0"/>
		</server>       
    </servers>

Node 2 : has only one server instance and that has RC

<servers>		
		<server name="server-host2-RC" group="main-server-group"
auto-start="true">
		<jvm name="default">
			<heap size="2048m" max-size="2048m"/>
			<permgen size="512m" max-size="512m"/>
			<jvm-options>
        <!--<option
value="-Xrunjdwp:transport=dt_socket,address=8787,server=y,suspend=n"/>-->
				<option
value="-XX:CompileCommand=exclude,com/newscale/bfw/signon/filters,AuthenticationFilter"/>
				<option
value="-XX:CompileCommand=exclude,org/apache/xml/dtm/ref/sax2dtm/SAX2DTM,startElement"/>
				<option
value="-XX:CompileCommand=exclude,org/exolab/castor/xml/Marshaller,marshal"/>
				<option
value="-XX:CompileCommand=exclude,org/exolab/castor/xml/Marshaller,marshal"/>
				<option
value="-XX:CompileCommand=exclude,org/apache/xpath/compiler/XPathParser,UnionExpr"/>
			</jvm-options>
		</jvm>
		<socket-bindings socket-binding-group="ha-sockets"
port-offset="0"/>
		</server>       
    </servers>

Now when I say its not working as expected, when we test failover, I mean - the
communication of RC and SL is broken. After the communication is broken, we have currently
a workaround -restarting the RC and SL on cluster, which makes it work as before.

By the way RC communicates with SL on port 6080 via http.
RC communicates with JMS on SL (JMS JNDI url:http-remoting://<ip address of SL which is
Node1>:6080/)

Just a note:- everything is working properly in production if we don't try the
failover testing scenario by disabling network or powering off etc.

Let me know if you need any other info.

Thanks,
Preeta

...
 Wildfly cluster's failover functionality doesn't work as
expected
 -----------------------------------------------------------------

                 Key: WFLY-6781
                 URL: https://issues.jboss.org/browse/WFLY-6781
             Project: WildFly
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 8.2.0.Final
            Reporter: Preeta Kuruvilla
            Assignee: Paul Ferraro
            Priority: Blocker

 Following are the testing scenarios we did and the outcome:-
 1. Network disabling on a VM for testing failover – Not working for both Linux and
Windows environment.
 2. Power off of a VM using VMware client  for testing failover – Is working on Linux
environment but not working on windows environment.
 3. Ctrl + C method to stop services on a node for testing failover – works on both linux
and windows environment
 4. Stopping server running on Node /VM using Admin Console  for testing failover  - works
on both linux and windows environment.
 Jgroups subsystem configuration in domain.xml we have is below:-
 <subsystem xmlns="urn:jboss:domain:jgroups:2.0"
default-stack="udp">
                 <stack name="udp">
                     <transport type="UDP"
socket-binding="jgroups-udp"/>
                     <protocol type="PING"/>
                     <protocol type="MERGE3"/>
                     <protocol type="FD_SOCK"
socket-binding="jgroups-udp-fd"/>
                     <protocol type="FD_ALL"/>
                     <protocol type="VERIFY_SUSPECT"/>
                     <protocol type="pbcast.NAKACK2"/>
                     <protocol type="UNICAST3"/>
                     <protocol type="pbcast.STABLE"/>
                     <protocol type="pbcast.GMS"/>
                     <protocol type="UFC"/>
                     <protocol type="MFC"/>
                     <protocol type="FRAG2"/>
                     <protocol type="RSVP"/>
                 </stack>
                 <stack name="tcp">
                     <transport type="TCP"
socket-binding="jgroups-tcp"/>
                     <protocol type="MPING"
socket-binding="jgroups-mping"/>
                     <protocol type="MERGE2"/>
                     <protocol type="FD_SOCK"
socket-binding="jgroups-tcp-fd"/>
                     <protocol type="FD"/>
                     <protocol type="VERIFY_SUSPECT"/>
                     <protocol type="pbcast.NAKACK2"/>
                     <protocol type="UNICAST3"/>
                     <protocol type="pbcast.STABLE"/>
                     <protocol type="pbcast.GMS"/>
                     <protocol type="MFC"/>
                     <protocol type="FRAG2"/>
                     <protocol type="RSVP"/>
                 </stack>
             </subsystem> 

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006