[mod_cluster-dev] Handling crashed/hung AS nodes

Mon Mar 30 11:29:37 EDT 2009

On Fri, 2009-03-27 at 16:23 -0500, Brian Stansberry wrote:
> Paul Ferraro wrote:
> > On Fri, 2009-03-27 at 09:15 -0500, Brian Stansberry wrote:
> >> jean-frederic clere wrote:
> >>> Paul Ferraro wrote:
> >>>> Currently, the HAModClusterService (where httpd communication is
> >>>> coordinated by an HA singleton) does not react to crashed/hung members.
> >>>> Specifically, when the HA singleton gets a callback that the group
> >>>> membership changes, it does not send any REMOVE-APP messages to httpd on
> >>>> behalf of the member that just left.  Currently, httpd will detect the
> >>>> failure (via a disconnected socket) on its own and sets its internal
> >>>> state accordingly, e.g. a STATUS message will return NOTOK.
> >>>>
> >>>> The non-handling of dropped members is actually a good thing in the
> >>>> event of a network partition, where communication between nodes is lost,
> >>>> but communication between httpd and the nodes is unaffected.  If we were
> >>>> handling dropped members, we would have to handle the ugly scenario
> >>>> described here:
> >>>> https://jira.jboss.org/jira/browse/MODCLUSTER-66
> >>>>
> >>>> Jean-Frederic: a few questions...
> >>>> 1. Is it up to the AS to drive the recovery of a NOTOK node when it
> >>>> becomes functional again?
> >>> Yes.
> >>>
> >>>>  In the case of a crashed member, fresh
> >>>> CONFIG/ENABLE-APP messages will be sent upon node restart.  In the case
> >>>> of a re-merged network partition, no additional messages are sent.  Is
> >>>> the subsequent STATUS message (with a non-zero lbfactor) enough to
> >>>> trigger the recovery of this node?
> >>> Yes.
> > 
> > Good to know.
> >  
> >>>> 2. Can httpd detect hung nodes?  A hung node will not affect the
> >>>> connected state of the AJP/HTTP/S connector - it could only detect this
> >>>> by sending data to the connector and timing out on the response.
> >>> The hung node will be detected and marked as broken but the 
> >>> corresponding request(s) may be delayed or lost due to time-out.
> >>>
> >> How long does this take, say in a typical case where the hung node was 
> >> up and running with a pool of AJP connections open? Is it the 10 secs, 
> >> the default value of the "ping" property listed at 
> >> https://www.jboss.org/mod_cluster/java/properties.html#proxy ?
> > 
> > I think he's talking about "nodeTimeout".
> > 
> 
> There's also the mod_proxy ProxyTimeout directive which might??? apply 
> since mod_proxy is what is being used. From the mod_proxy docs:
> 
> "This directive allows a user to specifiy a timeout on proxy requests. 
> This is useful when you have a slow/buggy appserver which hangs, and you 
> would rather just return a timeout and fail gracefully instead of 
> waiting however long it takes the server to return."
> 
> Default there is 300 seconds!!
> 
> It would be good to beef up the docs of this quite a bit. From the 
> mod_jk docs at 
> http://tomcat.apache.org/connectors-doc/reference/workers.html I can get 
> a pretty good idea of all the details of how mod_jk works in this area. 
> Not so much from 
> https://www.jboss.org/mod_cluster/java/properties.html#proxy 
> particularly when I factor in that it's mod_proxy/mod_proxy_ajp that's 
> actually handling requests.

Agreed.

> >> Also, if a request is being handled by a hung node and the 
> >> HAModClusterService tells httpd to stop that node, the request will 
> >> fail, yes? It shouldn't just fail over, as it may have already caused 
> >> the transfer of my $1,000,000 to my secret account at UBS. Failing over 
> >> would cause transfer of a second $1,000,000 and sadly I don't have that 
> >> much.
> > 
> > Not unlike those damn double-clickers...
> > 
> 
> Ah, that's what happened to my second $1,000,000! Thanks!
> 
> >>>> And some questions for open discussion:
> >>>> What does HAModClusterService really buy us over the normal
> >>>> ModClusterService?  Do the benefits outweigh the complexity?
> >>>>  * Maintains a uniform view of proxy status across each AS node
> >>>>  * Can detect and send STOP-APP/REMOVE-APP messages on behalf of
> >>>> hung/crashed nodes (if httpd cannot already do this) (not yet
> >>>> implemented)
> >>>>    + Requires special handling of network partitions
> >>>>  * Potentially improve scalability by minimizing network traffic for
> >>>> very large clusters.
> >> Assume a near-term goal is to run a 150 node cluster with say 10 httpd 
> >> servers. Assume the background thread runs every 10 seconds. That comes 
> >> to 150 connections per second across the cluster being opened/closed to 
> >> handle STATUS. Each httpd server handles 15 connections per second.
> >>
> >> With HAModClusterService the way it is now, you get the same, because 
> >> besides STATUS each node also checks its ability to communicate w/ each 
> >> httpd in order to validate its ability to become master. But let's 
> >> assume we add some complexity to allow that health check to become much 
> >> more infrequent. So ignore those ping checks. So, w/ HAModClusterService 
> >> you get 1 connection/sec being opened closed across the cluster for 
> >> status, 0.1 connection/sec per httpd.  But the STATUS request sent 
> >> across each connection has a much bigger payload.
> > 
> > True, the STATUS request body is larger than the INFO request (no body),
> > but the resulting STATUS-RSP is significantly smaller than the
> > corresponding INFO-RSP.
> > 
> 
> Ah, we're still using INFO for the connectivity checks? That's not good; 
> for sure we'd want to make that happen less often.

Yes - we need the INFO-RSP data to determine whether or not the proxy
requires resetting.

> I wasn't clear; what I was driving at was the STATUS request from 
> HAModClusterService covering 150 nodes has a bigger payload than any 
> single node's STATUS; i.e. there's no magic reduction in data. The 
> efficiency would be in getting rid of opening/closing lots of connections.

These connections stay open across proxy messages - so there's no cost
there.

> >> How significant is the cost of opening/closing all those connections?
> >>
> >>>>    e.g. non-masters ping httpd less often
> >>>>  * Anything else?
> >> 1) Management?? You don't want to have to interact with every node to do 
> >> management tasks (e.g. disable app X on domain A to drain sessions so we 
> >> can shut down the domain.) A node having a complete view might allow 
> >> more sophisticated management operations.  This is takes more thought 
> >> though.
> > 
> > Good point.
> > 
> >> 2) The more sophisticated case discussed on 
> >> https://jira.jboss.org/jira/browse/MODCLUSTER-66, where a primary 
> >> partition approach is appropriate rather than letting minority 
> >> subpartitions continue to live. But TBH mod_cluster might not be the 
> >> right place to handle this. Probably more appropriate is to have 
> >> something associated with the webapp itself determine it is in a 
> >> minority partition and undeploy the webapp if so. Whether being in a 
> >> minority partition is inappropriate for a particular webapp is beyond 
> >> scope for mod_cluster.
> >>
> >> I'd originally thought the HA version would add benefit to the load 
> >> balance factor calculation, but that was wrong-headed.
> > 
> > I would argue that there is some (albeit small) value to the load being
> > requested on each node at the same time.  I would expect this to result
> > in slightly less load swinging than if individual nodes calculated their
> > load at different times, scattered across the status interval.
> > 
> >>> Well I prefer the JAVA code deciding if a node is broken that httpd. I 
> >>> really want to keep the complexity in httpd to minimum and the talk I 
> >>> had until now at the ApacheCon seems to show that is probably the best 
> >>> way to go.
> >>>
> >> Agreed that httpd should be as simple as possible. But to handle the 
> >> non-HAModClusterService case it will need to at least detect broken 
> >> connections and basic response timeouts, right? So depending on how long 
> >> it takes to detect hung nodes, httpd might be detecting them before 
> >> HAModClusterService. I'm thinking of 3 scenarios:
> >>
> >> 1) Node completely crashes. HAModClusterService will detect this almost 
> >> immediately; I'd think httpd would as well unless it just happened to 
> >> not have connections open.
> >>
> >> 2) Some condition that causes the channel used by HAModClusterService to 
> >> not process messages. This will lead to the node being suspected after 
> >> 31.5 seconds with the default channel config. But httpd might detect a 
> >> timeout faster than that?
> >>
> >> 3) Some condition that causes all the JBoss Web threads to block but 
> >> doesn't impact the HAModClusterService channel. (QE's Radoslav Husar, 
> >> Bela and I are trying to diagnose such a case right now.)  Only httpd 
> >> will detect this; JGroups will not. We could add some logic in JBoss Web 
> >> that would allow it to detect such a situation and then let 
> >> (HA)ModClusterService disable the node. But non-HA ModClusterService 
> >> could do that just as well as the HA version.
> > 
> > I imagine this is not uncommon, e.g. overloaded/deadlocked database
> > causing all application threads to wait.
> > 
> >> Out of these 3 cases, HAModClusterService does a better job than httpd 
> >> itself only in the #2 case, and there only if it takes httpd > 31.5 secs 
> >> to detect a hung response.
> > 
> > Although, for case #2, using the plain non-HA ModClusterService avoids
> > the problem entirely.
> > 
> 
> I wasn't clear again. :) For #2 I meant some general problem w/ the node 
> (e.g. OOM) that affects the JGroups channel and the connectors. Then 
> HAModClusterService has a benefit only if it detects that faster than 
> httpd.  If *only* the JGroups channel is disrupted, yeah, that's the 
> same as the network partition problem you raised.
> 
> >>> Cheers
> >>>
> >>> Jean-Frederic
> >>>
> >>>> Paul
> >>>>
> >>>> _______________________________________________
> >>>> mod_cluster-dev mailing list
> >>>> mod_cluster-dev at lists.jboss.org
> >>>> https://lists.jboss.org/mailman/listinfo/mod_cluster-dev
> >>>>
> >>> _______________________________________________
> >>> mod_cluster-dev mailing list
> >>> mod_cluster-dev at lists.jboss.org
> >>> https://lists.jboss.org/mailman/listinfo/mod_cluster-dev
> >>
> > 
> 
>