Paul Ferraro created WFLY-12023:
-----------------------------------
Summary: Concurrent singleton service installation can cause service to run
simultaneously on 2 members.
Key: WFLY-12023
URL:
https://issues.jboss.org/browse/WFLY-12023
Project: WildFly
Issue Type: Bug
Components: Clustering
Affects Versions: 16.0.0.Final
Reporter: Paul Ferraro
Assignee: Paul Ferraro
There exists a race condition between concurrent elections triggered by different nodes.
In general, only 1 node actually runs the election for a given set of singleton
candidates. During a deployment replace, there are a rapid series of changes to the
candidates as the deployment is stopped and restarted. While each node processes them 1
at a time, this processing isn't synchronized across members. This is the root of the
problem, as a new election can be triggered on one node while another node is still in the
process of completing its election. Here's the scenario where observed the race
condition:
Before deployment replace, sever-one is the primary provider of the singleton service
Each node undeploys its application, and restarts. As each node redeploys, and the
singleton service is reinstalled, each node register itself as providing the singleton
service. The redeploy happens concurrently, but the registration order appears the same
on all nodes.
In this case, the registration order was server-three, server-two, server-one.
# server-three registers first, it elects itself and starts its service
# server-two registers next
## server-three defers election to server-two
## server-two runs the election:
### Elects itself
### Sends a synchronous service stop message to server-three
### Starts its service
# server-one registers next, while server-two is in the process of stopping the service on
server-three
## server-three defers election to server-one
## server-two is still in the election process, but will defer election to server-one once
complete
## server-one runs the election:
### Elects itself
### Sends service stop message to server-two and server-three
#### server-three is no longer running its service
#### server-two hasn't yet started its service, but it will soon (This is the
problem)
### server-one starts its service
### Meanwhile, server-two just received its response that the stop of service (2.B.II.)
and commences its own service start (1.B.III.)
Now server-one and server-two are both running the service.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)