[jboss-jira] [JBoss JIRA] (WFLY-11682) Clustered SLSB membership anomalies when all cluster members removed

Richard Achmatowicz (Jira) issues at jboss.org
Sat Mar 2 13:19:01 EST 2019


    [ https://issues.jboss.org/browse/WFLY-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702910#comment-13702910 ] 

Richard Achmatowicz commented on WFLY-11682:
--------------------------------------------

On the server side, the old EJB client implementation classes EJBRemoteConnectorService and  VersionOneProtocolChannelReceiver have been replaced by a number of classes which work together to do the same thing: EJBRemoteConnectorService, RemoteEJBService, AssociationService, AssociationImpl and EJBServerChannel. 

The VersionOneProtocolChannelReceiver held an instance of RegistryCollector (which holds client mappings entries for each cluster the server is a member of), which allowed supporting membership of multiple clusters at the same time well as notifying the server when a cluster was being removed (registryAdded()/registryRemoved()). A call to registryRemoved() by the last node in the cluster would cause the server to send a message to the client indicating the cluster was removed entirely, and was a way for the last node to clear that cluster's information from the client. 

The new arrangement only supports server membership of one cluster and AssociationImpl, rather than holding a reference to RegistryCollector, now only holds a single Registry to store the client mappings for the single cluster this server is a member of. Consequently, unlike before, we have no notification of the last node leaving via the RegistryCollector. We need to recreate this somehow.

We want to send out a CLUSTER_REMOVED message to all connected clients from the last node in the cluster when we are sure that the last node is in fact shutting down. Looking at the dependencies between the various services, we have EJBRemoteConnector -> AssociationService, so that the EJBRemoteConnector will shut down before the AssociationService. Also the EJBRemoteConnector, in its stop() method,  shuts down the Endpoint for the Remoting connector, making all channel connections to clients inaccessible thereafter.   So it seems the only reasonable place to send out this message is at the beginning of the stop() method of EJBRemoteConnector while the channel connections are still open.

I have implemented this and it works based on my limited testing. I'll push a PR for now so that I can get some feedback.

As Paul noted, this does not solve the problem in the case where the last node crashes (does not get a chance to cleanly shut down) and the only reasonable place to deal with this special case is on the client (somehow).. For now though, i'm making this change available to handle the case of clean shutdown.

Also, i'll implement the missing suspend/resume behavior in another issue. It's unrelated to this one. We should also probably revisit the limitation of single cluster membership at some time soon.


> Clustered SLSB membership anomalies when all cluster members removed
> --------------------------------------------------------------------
>
>                 Key: WFLY-11682
>                 URL: https://issues.jboss.org/browse/WFLY-11682
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering, EJB
>    Affects Versions: 15.0.1.Final
>         Environment: WildFly running in an n-node cluster with an EJB client sending requests even during the time the cluster is down.
>            Reporter: Jörg Bäsner
>            Assignee: Richard Achmatowicz
>            Priority: Major
>         Attachments: node1.txt, node12.txt, node2.txt, node3.txt, playground.zip
>
>
> This description will be based on a 3 node cluster. Cluster node 1 and 2 are configured in the {{PROVIDER_URL}}, node 3 is not.
> The client has a custom ClusterNodeSelector implementation that is printing the {{connectedNodes}} and the {{availableNodes}} and doing a random balancing.
> As long as all nodes are up and running the client is calling EJBs in a balanced way.
> When node1 is shut down, the client get the notification below:
> {code}...
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
> ...
> {code}
> Then node2 is shut down. Again the client get the information, see:
> {code}
> ...
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node2)
> ...
> {code}
> Finally node3 is being shut down. Now the client only get the following information:
> {code}
> ...
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9) message for module /playground
> ...
> {code}
> This mean the _node3_ is not being informed about the fact that the last node of the cluster has been stopped.
> From this point on the client is always getting {{Caused by: java.net.ConnectException: Connection refused}}
> Now node1 is started again, resulting in the following output for {{connectedNodes}} and the {{availableNodes}}:
> {code}
> ...
> INFO  (ThreadPoolTaskExecutor-1) [com.jboss.examples.ejb.CustomClusterNodeSelector] connectedNodes(1) '[node1]', availableNodes(2) '[node3, node1]'
> ...
> {code}



--
This message was sent by Atlassian Jira
(v7.12.1#712002)



More information about the jboss-jira mailing list