[jboss-jira] [JBoss JIRA] (WFLY-6762) Wildfly cluster failover test not working as expected on windows OS, when network is disabled on a VM(Node) or by shutting down the VM (Node).

Mon Jun 27 10:08:00 EDT 2016

    [ https://issues.jboss.org/browse/WFLY-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257442#comment-13257442 ] 

Preeta Kuruvilla commented on WFLY-6762:
----------------------------------------

Thanks Paul. We have configured UDP based multicast for Ehcache -Cache replication for Wildfly cluster. This uses the jgroups.
Below is the udp.xml. 

<!--
  Default stack using IP multicasting. It is similar to the "udp"
  stack in stacks.xml, but doesn't use streaming state transfer and flushing
  author: Bela Ban
-->

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
    <UDP
         mcast_port="${jgroups.udp.mcast_port:45588}"
         ip_ttl="4"
         tos="8"
         ucast_recv_buf_size="5M"
         ucast_send_buf_size="5M"
         mcast_recv_buf_size="5M"
         mcast_send_buf_size="5M"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         enable_diagnostics="true"
         thread_naming_pattern="cl"

         timer_type="new3"
         timer.min_threads="2"
         timer.max_threads="4"
         timer.keep_alive_time="3000"
         timer.queue_max_size="500"

         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="8"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="true"
         thread_pool.queue_max_size="10000"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>

    <PING />
    <MERGE3 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD_ALL/>
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 xmit_interval="500"
                    xmit_table_num_rows="100"
                    xmit_table_msgs_per_row="2000"
                    xmit_table_max_compaction_time="30000"
                    max_msg_batch_size="500"
                    use_mcast_xmit="true"
                    discard_delivered_msgs="true"/>
    <UNICAST3 xmit_interval="500"
              xmit_table_num_rows="100"
              xmit_table_msgs_per_row="2000"
              xmit_table_max_compaction_time="60000"
              conn_expiry_timeout="0"
              max_msg_batch_size="500"/>
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="2000"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <RSVP resend_interval="2000" timeout="10000"/>
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

> Wildfly cluster failover test not working as expected on windows OS, when network is disabled on a VM(Node) or by shutting down the VM (Node).
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: WFLY-6762
>                 URL: https://issues.jboss.org/browse/WFLY-6762
>             Project: WildFly
>          Issue Type: Quality Risk
>          Components: Clustering
>    Affects Versions: 8.2.0.Final
>            Reporter: Preeta Kuruvilla
>            Assignee: Paul Ferraro
>            Priority: Blocker
>
> In your mail related to WFLY-6749 you has said the below :-
> **The default stack contains the following failure detection protocols:
> FD_SOCK 
> FD_ALL
> These protocols are described here:
> http://www.jgroups.org/manual/index.html#FailureDetection
> I suspect that your method of simulating a failure - by disabling the network of the host machine is not being detected by FD_SOCK. It will however, be detected by FD_ALL, but only after 1 minute. The heartbeat timeout used by FD_ALL can be manipulated via the timeout property.
> e.g.
> <protocol type="FD_ALL" ><property name="timeout">60000</property></protocol>
> **************************************************************************************************
> Thanks for the quick response on WFLY-6749.
> Based on your suggestion, I had a taken a look at the testing scenarios mentioned in "Table 29. Failure detection behavior" in the link that you provided- http://www.jgroups.org/manual/index.html#FailureDetection. No where its mentioned that disabling a network on a node, is a valid testing scenario in Wildfly cluster. 
> The Failover is working properly when the network on a node is disabled on a weblogic cluster for our application. However it doesn't work and it hampers the application functionality on Wildfly cluster when we try to disable the network on a node in Wildfly cluster.
> However as I said earlier, the failover on wildfly cluster works when we stop a node from admin console or give Ctrl + C to stop the services on a node.
> Would like to get a confirmation from you that disabling the network on a node is not the valid failover testing scenario for wildfly cluster.
> Also we tried to test the same failover scenario by Shutting down/power off a VM (node) in a wildfly cluster. It did not work for Windows Environment although it worked for linux environment. 
> Note: we are using Windows 2012 environment. Here is a link I found: http://stackoverflow.com/questions/31218710/unable-to-stop-wildfly-8-2-service-on-windows
> https://developer.jboss.org/thread/238135?tstart=0
> Thanks,
> Preeta

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)