The failure detection protocol on another node considers the node to have failed [1] and
the VERIFY_SUSPECT protocol on the coordinator node concurs [2]. The coordinator then
publishes a new view that doesn't contain the suspect node. But the node hasn't
entirely failed; it begins working again and tries to send messages to the group, of which
it is no longer a member.
Typically this would be a case of something happening on the node that prevents it
responding to FD heartbeat requests for a period. Maybe a 100% CPU situation or something
blocking all the threads that pass messages up from the transport protocol. (In JG 2.4
there is just one such thread, so if it gets blocked for a while in the app, the node can
be suspected.)
[1] Details on the most common failure detection protocols:
http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsFD[/url] and
[
url]http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsFD_SOCK[/url] as well as
[
url]http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK
[2]
http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsVERIFY_SUSPECT
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138730#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...