[jboss-dev-forums] [Design of Clustering on JBoss (Clusters/JBoss)] - Re: Handling of 'deployments taking ~1 minute' scenario

galder.zamarreno@jboss.com do-not-reply at jboss.com
Tue Jan 15 17:26:05 EST 2008


Yeah, that bug was fixed. In this latest case, the root cause was log4j that was logging over NFS and was having some issues in some cluster nodes. Logging over NFS is never a good idea, but I needed to be 100% sure that this is the cause of the slow deployments and not another JGroups bug ;).

I was lucky enough that when digging through the case, I was able to match the nodes for which the RPC called failed to the logs of two nodes that showed log4j issues, "stale NFS handle".

anonymous wrote : In some other case where a remote node "isn't responding" all you could do would be to send a message to "commit suicide" -- there's no mechanism to evict a node from the group outside of JGroups' own failure detection. But if the node isn't responding to RPCs, it likely wouldn't respond to the "commit suicide" either.

If it wasn't responding to RPCs, FD/FD_SOCK eventually would discover that the node is not responding. In this case though, failure detection layer was Ok, so cluster was not dismantled, but something was wrong that was disrupting a healthy cluster. Customer was concerned about such scenario.

anonymous wrote : Logically, I could see some benefit in some sort of self-healing approach where cluster members detect faults and restart themselves or send commands to others telling them to restart. But this will take a lot of thought.

I'll fill in a JIRA tomorrow to track this.

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4120248#4120248

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4120248



More information about the jboss-dev-forums mailing list