A tip for those who come this way after us: we found a large part of the problem was that
the cluster nodes rely on being in constant communication.
If one of them is under high load (say, running some reports or something) its CPU usage
may be so high it does not respond to the cluster ping quickly enough (within 3 seconds).
The cluster then treats it as dead and removes it from the cluster, even though it is not
dead it is just busy.
We increased the org.jgroups.protocols.pbcast.GMS timeout and it helped a great deal.
View the original post :
http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4259958#...
Reply to the post :
http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&a...