[mod_cluster-issues] [JBoss JIRA] Commented: (MODCLUSTER-66) HAModClusterService needs to handle cluster splits

Wed Aug 5 10:50:29 EDT 2009

    [ https://jira.jboss.org/jira/browse/MODCLUSTER-66?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12479051#action_12479051 ] 

Bela Ban commented on MODCLUSTER-66:
------------------------------------

I wanted to reiterate the importance of [1] for the next release of mod-cluster.

I'm constantly running into this when I deploy httpd/mod-cluster and JBoss 5.1.0 on Amazon's EC2 cloud. The easy way to stop an instance (OS + JBoss) on EC2 is to 'terminate' it via the AWS Console, which shuts down the OS ("shutdown -h now").

Unfortunately, our AMIs only had an S98jboss in /etc/rc4.d for starting JBoss, but no corresponding K98jboss for stopping it *gracefully* on shutdown. Therefore the process was always killed via -9.

This caused very long timeouts and 5XX HTTP responses, until httd/mod-cluster finally figured out that the worker crashed and failed over to a different worker.

As a workaround, I created a K98jboss link so now JBoss is shut down gracefully when the host is terminated.

However, I figure we can get into this situation in many different ways, e.g.

    * Not providing a K98jboss script on EC2
    * Killing JBoss with -9 via a script (I've seen this many more than once
    * Pulling a blade out of the rack. A crude way of shutting down an instance, but that's normal in large clusters !

[1] https://jira.jboss.org/jira/browse/MODCLUSTER-66

> HAModClusterService needs to handle cluster splits
> --------------------------------------------------
>
>                 Key: MODCLUSTER-66
>                 URL: https://jira.jboss.org/jira/browse/MODCLUSTER-66
>             Project: mod_cluster
>          Issue Type: Task
>    Affects Versions: 1.0.0.Beta4
>            Reporter: Brian Stansberry
>            Assignee: Paul Ferraro
>
> The case where a split of the JGroups group occurs but nodes are still able to contact the httpd servers needs to be handled. There is a brief discussion of this on https://www.jboss.org/community/docs/DOC-11431 under "Split-Brain Syndrome".  Problem is split-brain will result in nodes removing each other from httpd, resulting in no nodes active.
> The wiki page describes a simple approach. A more complex approach would be to take a "primary partition" approach, whereby say an initial cluster of size n==6 {A, B, C, D, E, F} splits into two cluster {A, B, C, D} and { E, F}. To continue to handle requests a partition would need to have at least Math.floor((float) n / 2 + 1) members.
> What kind of approach is appropriate would probably depend on the deployed webapps and how they interact with the cluster. If there is no clustered state that can become inconsistent across the cluster split, the simple approach described on the wiki can work fine (an HAModClusterService master doesn't disable a node if httpd reports it is still available).  If there is shared state that needs to remain consistent (e.g. a clustered Hibernate Second Level Cache) then primary partition works better.
> Most likely this overall problem will be resolved in stages, e.g. the simple approach from the wiki first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira