On 4/18/16 10:18 AM, Bela Ban wrote:
Hey Brian,
On 18/04/16 17:04, Brian Stansberry wrote:
> As an FYI, copying Bela Ban, who I stupidly forgot to copy on the first
> post. Sebastian kindly copied him on the other main branch of the thread.
Yes, I read that thread.
> Bela, tl;dr on this branch is it mostly discusses concerns about N^2 TCP
> connections in a possibly very large cluster. Whether the JGroups
> cluster would need to get very large depends on what use cases we used
> it to solve.
When I tested on a 2000+ node cluster running over TCP in Google Compute
Engine, TCP wasn't that much of an issue.
Good!
The main drawbacks were that
every node needed to have ~2000 connections open, which means 1 reader
thread running per connection. However, connections are closed after a
configurable idle time.
TCP_NIO2 is much better in that respect as it gets rid of the reader
threads even if the connection is open.
In the POC I did, the transport uses the NIO-based JBoss Remoting
infrastructure we already use for intra-domain communications. All
connections are created using a single remoting Endpoint instance, which
in turn uses a Xnio worker. That worker is configured with two io
threads, and then a pool of 5 min, max 10 threads for handling tasks.
That pool size is just the settings that were already in the code for
the existing uses of the endpoint (CLI requests, intra-domain comms etc)
and I spent zero time when I did the POC thinking about whether those
settings are appropriate if we also add JGroups traffic to the mix.
For the existing management uses of the endpoint, most of the work
actually gets shunted off to threads from a separate pool. A management
request comes in and the xnio worker threads deal with the initial work
of reading it off the wire or writing the response, but the bulk of the
work in between of actually doing the management stuff is done on
another thread. I'd need to refamiliarize myself with TP and the thread
pools JGroups uses to see if we get a similar effect with the JGroups
communications. I have some vague memories of up pools and down pools
and OOB pools and .... ;) All surely out of date.
The other option is to use UDP without multicasting, ie.
ip_mcast=false.
This would not create N-1 connections and possibly N-1 reader threads
and sockets, but only 2 sockets (constant) and no reader threads. A
message would still need to be sent N-1 times though, creating increased
traffic.
I don't think we could do that, at least not with this approach using
the existing JBoss Remoting server sockets. That's all TCP based.
A potential solution for going from N-1 to a constant number of
connections/threads would be daisy chaining where you only connect to
your neighbor and a multicast basically is 1 round across the logical
overlay, see [1] for details. I'd have to revisit this protocol though
if you wanted to use it, so let me know asap for me to include this in
the roadmap.
Ok, good to know. Will do.
Cheers,
[1]
http://belaban.blogspot.ch/2010/08/daisychaining-in-clouds.html
> On 4/11/16 4:20 PM, Brian Stansberry wrote:
>> On 4/11/16 3:43 PM, Ken Wills wrote:
>>>
>>>
>>> On Mon, Apr 11, 2016 at 11:57 AM, Brian Stansberry
>>> <brian.stansberry(a)redhat.com
<mailto:brian.stansberry@redhat.com>>
>>> wrote:
>>>
>>> Just an FYI: I spent a couple days and worked up a POC[1] of
>>> creating a
>>> JGroups-based reliable group communication mesh over the sockets
>>> our
>>> Host Controllers use for intra-domain management communications.
>>>
>>>
>>> Nice! I've been thinking about the mechanics of this a bit recently,
>>> but
>>> I hadn't gotten to any sort of transport details, this looks
>>> interesting.
>>>
>>> Currently those sockets are used to form a tree of connections;
>>> master
>>> HC to slave HCs and then HCs to their servers. Slave HCs don't
>>> talk to
>>> each other. That kind of topology works fine for our current use
>>> cases,
>>> but not for other use cases, where a full communication mesh is
>>> more
>>> appropriate.
>>>
>>> 2 use cases led me to explore this:
>>>
>>> 1) A longstanding request to have automatic failover of the
>>> master HC to
>>> a backup. There are different ways to do this, but group
>>> communication
>>> based leader election is a possible solution. My preference,
>>> really.
>>>
>>>
>>> I'd come to the same conclusion of it being an election. A
>>> deterministic
>>> election algorithm, perhaps allowing the configuration to supply some
>>> sort of weighted value to influence the election on each node, perhaps
>>> analogous to how the master browser smb election works (version +
>>> weight
>>> + etc).
>>
>> Yep.
>>
>> For sure the master must be running the latest version.
>>
>>>
>>>
>>> 2)
https://issues.jboss.org/browse/WFLY-1066, which has led to
>>> various
>>> design alternatives, one of which is a distributed cache of
>>> topology
>>> information, available via each HC. See [2] for some of that
>>> discussion.
>>>
>>> I don't know if this kind of communication is a good idea, or if
>>> it's
>>> the right solution to either of these use cases. Lots of things
>>> need
>>> careful thought!! But I figured it was worth some time to
>>> experiment.
>>> And it worked in at least a basic POC way, hence this FYI.
>>>
>>>
>>> Not knowing a lot about jgroups .. for very large domains is the mesh
>>> NxN in size?
>>
>> Yes.
>>
>> For thousands of nodes would this become a problem,
>>
>> It's one concern I have, yes. There are large JGroups clusters, but they
>> may be based on the UDP multicast transport JGroups offers.
>>
>>> or would
>>> a mechanism to segment into local groups perhaps, with only certain
>>> nodes participating in the mesh and being eligible for election?
>>
>>
>> For sure we'd have something in the host.xml that controls whether a
>> particular HC joins the group.
>>
>> I don't think this is a big problem for the DC election use case, as you
>> don't need a large number of HCs in the group. You'd have a few
>> "potential" DCs that could join the group, and the remaining slaves
>> don't need to.
>>
>> For use cases where you want slave HCs to be in the cluster though, it's
>> a concern. The distributed topology cache thing may or may not need
>> that. It needs a few HCs to provide HA, but those could be the same ones
>> that are "potential" HCs. But if only a few are in the group, the
>> servers need to be told how to reach those HCs. Chicken and egg, as the
>> point of the topology cache is to provide that kind of data to servers!
>> If a server's own HC is required to be a part of the group though, that
>> helps cut through the chicken/egg problem.
>>
>>
>>> Ken
>>>
>>
>>
>
>
--
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat