I think the primary partition approach is best. Caches not in the primary partition
purging their in memory state is probably the wrong path though, since as a generic
solution, not all installations will be backed by shared databases.
Caches shutting down would be my preferred option. Perhaps block for a short period,
hoping the network would heal, and then throw an exception after a timeout. Perhaps a
specific exception - SplitBrainException or something - so that cache users such as HTTP
Replication can react by forcing an HTTP response like 410 (don't know if this is
possible - Brian?) such that the load balancer will treat the node as unavailable. Once
the partition heals the cache is made available to requests again after performing a state
transfer to come up to speed with the primary partition.
Even the impact of incorrectly identifying a primary partition is low, since at worst
case, the larger partition is unresponsive while the smaller one is. I guess the real
problem is more than one partition thinking it is primary. :-)
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4084139#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...