[jboss-jira] [JBoss JIRA] (WFLY-11682) Clustered SLSB membership anomalies when all cluster members removed

Richard Achmatowicz (Jira) issues at jboss.org
Wed Mar 13 11:51:00 EDT 2019


    [ https://issues.jboss.org/browse/WFLY-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707930#comment-13707930 ] 

Richard Achmatowicz commented on WFLY-11682:
--------------------------------------------

I have had a look at the crash scenario. When a discovery call is made, it consists of two parts: refreshing the discovered node registry (DNR) before carrying out the search, then carrying out the search in the DNR. The refresh of the DNR is done by attempting to create an authenticated connection followed by a client channel to the set of configured connections (those listed in the client context configuration file) as well as any known URIs received as topology updates. When the connections/channels are attempted, the discovery provider keeps track of exceptions encountered during connection establishment and also during channel establishment. We would be interested in java.net.ConnectException raised during connection establishment. If we can keep track of those during the refresh of the DNR, they could be used to remove any cluster nodes which were the only node in the cluster and which raised a connect exception before the search starts.

  

> Clustered SLSB membership anomalies when all cluster members removed
> --------------------------------------------------------------------
>
>                 Key: WFLY-11682
>                 URL: https://issues.jboss.org/browse/WFLY-11682
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering, EJB
>    Affects Versions: 15.0.1.Final
>         Environment: WildFly running in an n-node cluster with an EJB client sending requests even during the time the cluster is down.
>            Reporter: Jörg Bäsner
>            Assignee: Richard Achmatowicz
>            Priority: Major
>         Attachments: node1.txt, node12.txt, node2.txt, node3.txt, playground.zip
>
>
> This description will be based on a 3 node cluster. Cluster node 1 and 2 are configured in the {{PROVIDER_URL}}, node 3 is not.
> The client has a custom ClusterNodeSelector implementation that is printing the {{connectedNodes}} and the {{availableNodes}} and doing a random balancing.
> As long as all nodes are up and running the client is calling EJBs in a balanced way.
> When node1 is shut down, the client get the notification below:
> {code}...
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
> ...
> {code}
> Then node2 is shut down. Again the client get the information, see:
> {code}
> ...
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node2)
> ...
> {code}
> Finally node3 is being shut down. Now the client only get the following information:
> {code}
> ...
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> ...
> {code}
> This mean the _node3_ is not being informed about the fact that the last node of the cluster has been stopped.
> From this point on the client is always getting {{Caused by: java.net.ConnectException: Connection refused}}
> Now node1 is started again, resulting in the following output for {{connectedNodes}} and the {{availableNodes}}:
> {code}
> ...
> INFO  (ThreadPoolTaskExecutor-1) [com.jboss.examples.ejb.CustomClusterNodeSelector] connectedNodes(1) '[node1]', availableNodes(2) '[node3, node1]'
> ...
> {code}



--
This message was sent by Atlassian Jira
(v7.12.1#712002)



More information about the jboss-jira mailing list