]
Richard Achmatowicz commented on WFLY-11682:
--------------------------------------------
I have had a look at the crash scenario. When a discovery call is made, it consists of two
parts: refreshing the discovered node registry (DNR) before carrying out the search, then
carrying out the search in the DNR. The refresh of the DNR is done by attempting to create
an authenticated connection followed by a client channel to the set of configured
connections (those listed in the client context configuration file) as well as any known
URIs received as topology updates. When the connections/channels are attempted, the
discovery provider keeps track of exceptions encountered during connection establishment
and also during channel establishment. We would be interested in java.net.ConnectException
raised during connection establishment. If we can keep track of those during the refresh
of the DNR, they could be used to remove any cluster nodes which were the only node in the
cluster and which raised a connect exception before the search starts.
Clustered SLSB membership anomalies when all cluster members removed
--------------------------------------------------------------------
Key: WFLY-11682
URL:
https://issues.jboss.org/browse/WFLY-11682
Project: WildFly
Issue Type: Bug
Components: Clustering, EJB
Affects Versions: 15.0.1.Final
Environment: WildFly running in an n-node cluster with an EJB client sending
requests even during the time the cluster is down.
Reporter: Jörg Bäsner
Assignee: Richard Achmatowicz
Priority: Major
Attachments: node1.txt, node12.txt, node2.txt, node3.txt, playground.zip
This description will be based on a 3 node cluster. Cluster node 1 and 2 are configured
in the {{PROVIDER_URL}}, node 3 is not.
The client has a custom ClusterNodeSelector implementation that is printing the
{{connectedNodes}} and the {{availableNodes}} and doing a random balancing.
As long as all nodes are up and running the client is calling EJBs in a balanced way.
When node1 is shut down, the client get the notification below:
{code}...
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-4) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received
CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received
CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node1)
...
{code}
Then node2 is shut down. Again the client get the information, see:
{code}
...
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received
CLUSTER_TOPOLOGY_NODE_REMOVAL(18) message for (cluster, node) = (ejb, node2)
...
{code}
Finally node3 is being shut down. Now the client only get the following information:
{code}
...
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
DEBUG (XNIO-1 task-1) [org.jboss.ejb.client.invocation] Received MODULE_UNAVAILABLE(9)
message for module /playground
...
{code}
This mean the _node3_ is not being informed about the fact that the last node of the
cluster has been stopped.
From this point on the client is always getting {{Caused by: java.net.ConnectException:
Connection refused}}
Now node1 is started again, resulting in the following output for {{connectedNodes}} and
the {{availableNodes}}:
{code}
...
INFO (ThreadPoolTaskExecutor-1) [com.jboss.examples.ejb.CustomClusterNodeSelector]
connectedNodes(1) '[node1]', availableNodes(2) '[node3, node1]'
...
{code}