Yeah, that bug was fixed. In this latest case, the root cause was log4j that was logging
over NFS and was having some issues in some cluster nodes. Logging over NFS is never a
good idea, but I needed to be 100% sure that this is the cause of the slow deployments and
not another JGroups bug ;).
I was lucky enough that when digging through the case, I was able to match the nodes for
which the RPC called failed to the logs of two nodes that showed log4j issues, "stale
NFS handle".
anonymous wrote : In some other case where a remote node "isn't responding"
all you could do would be to send a message to "commit suicide" -- there's
no mechanism to evict a node from the group outside of JGroups' own failure detection.
But if the node isn't responding to RPCs, it likely wouldn't respond to the
"commit suicide" either.
If it wasn't responding to RPCs, FD/FD_SOCK eventually would discover that the node is
not responding. In this case though, failure detection layer was Ok, so cluster was not
dismantled, but something was wrong that was disrupting a healthy cluster. Customer was
concerned about such scenario.
anonymous wrote : Logically, I could see some benefit in some sort of self-healing
approach where cluster members detect faults and restart themselves or send commands to
others telling them to restart. But this will take a lot of thought.
I'll fill in a JIRA tomorrow to track this.
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4120248#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...