[jboss-dev-forums] [Design of Messaging on JBoss (Messaging/JBoss)] - Re: split-brain between live and backup node

jmesnil do-not-reply at jboss.com
Thu Feb 12 10:20:16 EST 2009


moving forward...

The test SplitBrainTest.testDemonstrateSplitBrain shows how to reach a split-brain tests where the same messages are consumed by 2 different consumers.

To prevent this split-brain to occur where live node remains active once the backup node has been activated, the strategy would be:


when the live node lose its replicating connection
  - this can be because the backup node has been activated or crashed or the network is cut b/w the live and backup node
  - to check if the live node is isolated or not, it sends a messages to other nodes
    - if it reaches the quorum, it stays alive
    - else, it has been cut from both the backup and the other cluster nodes, it kills itself => the backup is the only active node


However, this won't solve the split-brain which may occur when the network is cut between the live & backup nodes but the live node remains connected to other cluster nodes.
In that case, the live node will reach the quorum and remain active while the backup node has also been activated.

What is the required quorum?

The simplest solution is to have a majority of members; the members being:
  - the live node
  - the backup node
  - the other live nodes of the cluster

Given the special relation between the backup and the live node, the live node should pay special attention to a response from the backup:

 - if the backup does not reply => the network is still cut between the live and backup node or the backup node is crashed
 - if it replied => the network failure was transient. In that case, the backup response should include a "active" boolean
   - if the backup node is active, the live node should kill itself
   - else, the live node can continue to live (and perhaps it can also reopen its replicating connection to the backup)


If there are no other nodes in the cluster, we can't apply this strategy.

Another thing worth mentioning: the live & backup should be on the same LAN while the other cluster nodes may be on a WAN.

To sum up, I need to think about it more...


View the original post : http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4209575#4209575

Reply to the post : http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=4209575



More information about the jboss-dev-forums mailing list