[JBoss JIRA] (WFLY-11682) Clustered SLSB membership anomalies when all cluster members removed

Saturday, 2 March 2019

    [
https://issues.jboss.org/browse/WFLY-11682?page=com.atlassian.jira.plugin...
] 

Richard Achmatowicz commented on WFLY-11682:
--------------------------------------------

On the server side, the old EJB client implementation classes EJBRemoteConnectorService
and  VersionOneProtocolChannelReceiver have been replaced by a number of classes which
work together to do the same thing: EJBRemoteConnectorService, RemoteEJBService,
AssociationService, AssociationImpl and EJBServerChannel. 

The VersionOneProtocolChannelReceiver held an instance of RegistryCollector (which holds
client mappings entries for each cluster the server is a member of), which allowed
supporting membership of multiple clusters at the same time well as notifying the server
when a cluster was being removed (registryAdded()/registryRemoved()). A call to
registryRemoved() by the last node in the cluster would cause the server to send a message
to the client indicating the cluster was removed entirely, and was a way for the last node
to clear that cluster's information from the client. 

The new arrangement only supports server membership of one cluster and AssociationImpl,
rather than holding a reference to RegistryCollector, now only holds a single Registry to
store the client mappings for the single cluster this server is a member of. Consequently,
unlike before, we have no notification of the last node leaving via the RegistryCollector.
We need to recreate this somehow.

We want to send out a CLUSTER_REMOVED message to all connected clients from the last node
in the cluster when we are sure that the last node is in fact shutting down. Looking at
the dependencies between the various services, we have EJBRemoteConnector ->
AssociationService, so that the EJBRemoteConnector will shut down before the
AssociationService. Also the EJBRemoteConnector, in its stop() method,  shuts down the
Endpoint for the Remoting connector, making all channel connections to clients
inaccessible thereafter.   So it seems the only reasonable place to send out this message
is at the beginning of the stop() method of EJBRemoteConnector while the channel
connections are still open.

I have implemented this and it works based on my limited testing. I'll push a PR for
now so that I can get some feedback.

As Paul noted, this does not solve the problem in the case where the last node crashes
(does not get a chance to cleanly shut down) and the only reasonable place to deal with
this special case is on the client (somehow).. For now though, i'm making this change
available to handle the case of clean shutdown.

Also, i'll implement the missing suspend/resume behavior in another issue. It's
unrelated to this one. We should also probably revisit the limitation of single cluster
membership at some time soon.

...
 Clustered SLSB membership anomalies when all cluster members removed
 --------------------------------------------------------------------

                 Key: WFLY-11682
                 URL: https://issues.jboss.org/browse/WFLY-11682
             Project: WildFly
          Issue Type: Bug
          Components: Clustering, EJB
    Affects Versions: 15.0.1.Final
         Environment: WildFly running in an n-node cluster with an EJB client sending
requests even during the time the cluster is down.
            Reporter: Jörg Bäsner
            Assignee: Richard Achmatowicz
            Priority: Major
         Attachments: node1.txt, node12.txt, node2.txt, node3.txt, playground.zip

 This description will be based on a 3 node cluster. Cluster node 1 and 2 are configured
in the {{PROVIDER_URL}}, node 3 is not.
 The client has a custom ClusterNodeSelector implementation that is printing the
{{connectedNodes}} and the {{availableNodes}} and doing a random balancing.
 As long as all nodes are up and running the client is calling EJBs in a balanced way.
 When node1 is shut down, the client get the notification below:
 {code}...
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received
CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received
CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
 ...
 {code}
 Then node2 is shut down. Again the client get the information, see:
 {code}
 ...
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received
CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node2)
 ...
 {code}
 Finally node3 is being shut down. Now the client only get the following information:
 {code}
 ...
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
 ...
 {code}
 This mean the _node3_ is not being informed about the fact that the last node of the
cluster has been stopped.
 From this point on the client is always getting {{Caused by: java.net.ConnectException:
Connection refused}}
 Now node1 is started again, resulting in the following output for {{connectedNodes}} and
the {{availableNodes}}:
 {code}
 ...
 INFO  (ThreadPoolTaskExecutor-1) [com.jboss.examples.ejb.CustomClusterNodeSelector]
connectedNodes(1) '[node1]', availableNodes(2) '[node3, node1]'
 ...
 {code} 

--
This message was sent by Atlassian Jira
(v7.12.1#712002)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006