On 4/12/16 4:43 AM, Ryan Emerson wrote:
Overall looks good to me, however I have a question about the
automatic failover use case, how do you intend to handle split brain scenarios?
My basic thought was to require a quorum and if no quorum is available,
provide a degraded level of service.
A degraded level of service probably means no master. A domain can
function with no master. I can brainstorm about possible slight
enhancements to service beyond that, but I question whether they are
worth the effort, at least initially.
Example scenarios: You have a network of {Master HC, Slave1, Slave2,
Slave3, Slave4, Slave5} and the network splits into two partitions of {Master HC, Slave1,
Slave2} and {Slave3, Slave4, Slave5}. Or even three distinct partitions consisting of #2
nodes.
I think we need a 3rd conceptual type -- a Potential Master. Not just
any HC can become master. It has to:
1) Be the latest version.
2) Be configured such that it's keeping a complete set of the domain
config and any domain managed content.
3) Is configured to use the group communication service used for leader
election.
4) Most likely it would also have specific config saying it can be a
master. I doubt this is something users will want to leave to chance.
So, electing a leader requires a quorum of Potential Masters.
If no additional provisions were added, how detrimental would it be
if two Master HCs were elected in distinct partitions and the network partitions became
one again (resulting in two Master HCs)?
Two masters means two potentially inconsistent domain configurations
(i.e. domain.xml and content repo) are possible. We don't want that,
hence the quorum requirement.
A question is what should slave HCs do in the absence of a master. They
are isolated from control by a master, but don't know if there is still
a functioning set of DC+slaves out there, meaning the slaves may be
missing relevant config changes. Should they shut down, or keep going?
We already have this issue though, and we've elected to have the slaves
keep going, updating their config if they can reconnect to a master. We
chose to keep the appservers running, and not to have them be vulnerable
to problems with master-slave connectivity. Having autopromotion of a
new master makes it slightly more valid to just shut down, since going
masterless is less likely, but I still think it's not a good idea.
----- Original Message -----
From: "Brian Stansberry" <brian.stansberry(a)redhat.com>
To: wildfly-dev(a)lists.jboss.org
Sent: Monday, 11 April, 2016 5:57:59 PM
Subject: [wildfly-dev] Inter Host Controller group communication mesh
Just an FYI: I spent a couple days and worked up a POC[1] of creating a
JGroups-based reliable group communication mesh over the sockets our
Host Controllers use for intra-domain management communications.
Currently those sockets are used to form a tree of connections; master
HC to slave HCs and then HCs to their servers. Slave HCs don't talk to
each other. That kind of topology works fine for our current use cases,
but not for other use cases, where a full communication mesh is more
appropriate.
2 use cases led me to explore this:
1) A longstanding request to have automatic failover of the master HC to
a backup. There are different ways to do this, but group communication
based leader election is a possible solution. My preference, really.
2)
https://issues.jboss.org/browse/WFLY-1066, which has led to various
design alternatives, one of which is a distributed cache of topology
information, available via each HC. See [2] for some of that discussion.
I don't know if this kind of communication is a good idea, or if it's
the right solution to either of these use cases. Lots of things need
careful thought!! But I figured it was worth some time to experiment.
And it worked in at least a basic POC way, hence this FYI.
If you're interested in details, here are some Q&A:
Q: Why JGroups?
A: Because 1) I know it well 2) I trust it and 3) it's already used for
this kind of group communications in full WildFly.
Q: Why the management sockets? Why not other sockets?
A: Slave HCs already need configuration for how to discover the master.
Using the same sockets lets us reuse that discovery configuration for
the JGroups communications as well. If we're going to use this kind of
communication in an serious way, the configuration needs to be as easy
as possible.
Q: How does it work?
A: JGroups is based on a stack of "protocols" each of which handles one
aspect of reliable group communications. The POC creates and uses a
standard protocol stack, except it replaces two standard protocols with
custom ones:
a) JGroups has various "Discovery" protocols which are used to find
possible peers. I implemented one that integrates with the HC's domain
controller discovery logic. It's basically a copy of the oft used
TCPPING protocol with about 10-15 lines of code changed.
b) JGroups has various "Transport" protocols which are responsible for
actually sending/receiving over the network. I created a new one of
those that knows how to use the WF management comms stuff built on JBoss
Remoting. JGroups provides a number of base classes to use in this
transport area, so I was able to rely on a lot of existing functionality
and could just focus on the details specific to this case.
Q: What have you done using the POC?
A: I created a master HC and a slave on my laptop and saw them form a
cluster and exchange messages. Typical stuff like starting and stopping
the HCs worked. I see no reason why having multiple slaves wouldn't have
worked too; I just didn't do it.
Q: What's next?
A: Nothing really. We have a couple concrete use cases we're looking to
solve. We need to figure out the best solution for those use cases. If
this kind of thing is useful in that, great. If not, it was a fun POC.
[1]
https://github.com/wildfly/wildfly-core/compare/master...bstansberry:jgro...
. See the commit message on the single commit to learn a bit more.
[2]
https://developer.jboss.org/wiki/ADomainManagedServiceRegistry
--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat