[jboss-jira] [JBoss JIRA] (WFLY-12214) JGRP000029: failed sending message: java.net.ConnectException: Connection refused

Thu Jun 20 07:52:01 EDT 2019

     [ https://issues.jboss.org/browse/WFLY-12214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tommasso Borgato updated WFLY-12214:
------------------------------------
    Description: 
The error is observed in fail-over clustering tests where fail-over is "shutdown" and the jgroups stack is TCP.

The error was not observed in:

wildfly-17.0.0.Final.zip

Right after one node (wildfly1) is shut down and restarted and the next node (wildfly2) is shut down we see:

{noformat}
2019-06-20 10:55:07,021 INFO  [org.infinispan.CLUSTER] (thread-84,ejb,wildfly1) ISPN000094: Received new cluster view for channel ejb: [wildfly3|6] (3) [wildfly3, wildfly4, wildfly1]
2019-06-20 10:55:07,022 INFO  [org.infinispan.CLUSTER] (thread-84,ejb,wildfly1) ISPN100001: Node wildfly2 left the cluster
2019-06-20 10:55:07,109 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
2019-06-20 10:55:07,251 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
2019-06-20 10:55:07,342 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
2019-06-20 10:55:07,437 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
{noformat}

Complete logs [here|https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP7/view/EAP7-Clustering/view/EAP7-Clustering-Database/job/eap-7.x-clustering-db-session-shutdown-repl-mssql-2016/2/artifact/report/wildfly/wlf_20194320-104347-wildfly-service-1-server.log/*view*/].

The number of errors is about 1000 per node;

Overall fail-rate is still close to 0%.

  was:
The error is observed in fail-over clustering tests where fail-over is "shutdown" and the jgroups subsystem is as follows:

{noformat}
        <subsystem xmlns="urn:jboss:domain:jgroups:7.0">
            <channels default="ee">
                <channel name="ee" stack="tcp" cluster="ejb"/>
            </channels>
            <stacks default="tcp">
                <stack name="udp">
                    <transport type="UDP" socket-binding="jgroups-udp">
                        <property name="ip_ttl">
                            32
                        </property>
                    </transport>
                    <protocol type="PING"/>
                    <protocol type="MERGE3"/>
                    <socket-protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="UFC"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <socket-protocol type="MPING" socket-binding="jgroups-mping"/>
                    <protocol type="MERGE3"/>
                    <socket-protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
            </stacks>
        </subsystem>
{noformat}

It wasn't observed with the previous version:

{noformat}
        <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
            <channels default="ee">
                <channel name="ee" stack="tcp" cluster="ejb"/>
            </channels>
            <stacks default="tcp">
                <stack name="udp">
                    <transport type="UDP" socket-binding="jgroups-udp">
                        <property name="ip_ttl">
                            32
                        </property>
                    </transport>
                    <protocol type="PING"/>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="UFC"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <socket-protocol type="MPING" socket-binding="jgroups-mping"/>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
            </stacks>
        </subsystem>
{noformat}

Right after one node (wildfly1) is shut down and restarted and the next node (wildfly2) is shut down we see:

{noformat}
2019-06-20 08:00:46,880 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 82) WFLYUT0021: Registered web context: '/clusterbench-granular' for server 'default-server'
2019-06-20 08:00:46,939 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 88) WFLYUT0021: Registered web context: '/clusterbench' for server 'default-server'
2019-06-20 08:00:47,024 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 85) WFLYUT0021: Registered web context: '/clusterbench-passivating' for server 'default-server'
2019-06-20 08:00:47,331 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "clusterbench-ee8.ear" (runtime-name : "clusterbench-ee8.ear")
2019-06-20 08:00:47,335 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 47) WFLYSRV0010: Deployed "postgresql-connector.jar" (runtime-name : "postgresql-connector.jar")
2019-06-20 08:00:47,560 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
2019-06-20 08:00:47,562 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://10.0.146.117:9990/management
2019-06-20 08:00:47,563 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://10.0.146.117:9990
2019-06-20 08:00:47,563 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly Full 18.0.0.Beta1-SNAPSHOT (WildFly Core 9.0.1.Final) started in 26227ms - Started 1065 of 1293 services (538 services are lazy, passive or on-demand)
2019-06-20 08:02:19,985 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-8,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (134 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:ejb, UNICAST3: DATA, seqno=12148, TP: [cluster=ejb]
2019-06-20 08:02:20,002 INFO  [org.infinispan.CLUSTER] (thread-216,ejb,wildfly1) ISPN000094: Received new cluster view for channel ejb: [wildfly3|6] (3) [wildfly3, wildfly4, wildfly1]
2019-06-20 08:02:20,006 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-8,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (60 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: UNICAST3: ACK, seqno=284, conn_id=4, ts=109, TP: [cluster=ejb]
2019-06-20 08:02:20,027 INFO  [org.infinispan.CLUSTER] (thread-216,ejb,wildfly1) ISPN100001: Node wildfly2 left the cluster
2019-06-20 08:02:20,031 INFO  [org.infinispan.CLUSTER] (thread-216,ejb,wildfly1) ISPN000094: Received new cluster view for channel ejb: [wildfly3|6] (3) [wildfly3, wildfly4, wildfly1]
{noformat}

Whole logs [here|https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP7/view/EAP7-Clustering/view/EAP7-Clustering-Database/job/eap-7.x-clustering-db-session-shutdown-repl-postgresql-10.1-offload-profile/11/].

The number of errors is about 3000 per node;

Overall fail-rate is still low: about 0.55%, but it has increased sensibly if compared to the previous version where it was about 0.01%.

> JGRP000029: failed sending message: java.net.ConnectException: Connection refused
> ---------------------------------------------------------------------------------
>
>                 Key: WFLY-12214
>                 URL: https://issues.jboss.org/browse/WFLY-12214
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 18.0.0.Beta1
>            Reporter: Tommasso Borgato
>            Assignee: Paul Ferraro
>            Priority: Major
>
> The error is observed in fail-over clustering tests where fail-over is "shutdown" and the jgroups stack is TCP.
> The error was not observed in:
> wildfly-17.0.0.Final.zip
> Right after one node (wildfly1) is shut down and restarted and the next node (wildfly2) is shut down we see:
> {noformat}
> 2019-06-20 10:55:07,021 INFO  [org.infinispan.CLUSTER] (thread-84,ejb,wildfly1) ISPN000094: Received new cluster view for channel ejb: [wildfly3|6] (3) [wildfly3, wildfly4, wildfly1]
> 2019-06-20 10:55:07,022 INFO  [org.infinispan.CLUSTER] (thread-84,ejb,wildfly1) ISPN100001: Node wildfly2 left the cluster
> 2019-06-20 10:55:07,109 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
> 2019-06-20 10:55:07,251 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
> 2019-06-20 10:55:07,342 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
> 2019-06-20 10:55:07,437 ERROR [org.jgroups.protocols.TCP] (TQ-Bundler-7,ejb,wildfly1) JGRP000029: wildfly1: failed sending message to wildfly2 (133 bytes): java.net.ConnectException: Connection refused (Connection refused), headers: FORK: ejb:web, UNICAST3: DATA, seqno=44773, TP: [cluster=ejb]
> {noformat}
> Complete logs [here|https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP7/view/EAP7-Clustering/view/EAP7-Clustering-Database/job/eap-7.x-clustering-db-session-shutdown-repl-mssql-2016/2/artifact/report/wildfly/wlf_20194320-104347-wildfly-service-1-server.log/*view*/].
> The number of errors is about 1000 per node;
> Overall fail-rate is still close to 0%.

--
This message was sent by Atlassian Jira
(v7.12.1#712002)