Hi Ron,
No I did not resolve the problem completely. However, I am more aware about cause of the
problem now.
As I said there are two kind of servers, coordinators and slaves. The slaves are run with
very high concurrency factor. i.e. load average is around 6-7. So sometimes they are
unable to answer ping requests quickly enough.
To make things work I patched JBoss Remoting source to increase connection checker timeout
and tries number. This decreased failures rate downto acceptable level. Now we have
failure event one or two times per hour.
However, it would be good to improve JBoss Remoting by providing customization for the
server failure detection parameters. Moreover, I think ConnectionValidator should make
pauses between validity checks. Now it fires all validations in a row, so all of them may
be ignored by overloaded slave server.
Also it would be good idea to optimize jndi detector somehow. Now, all servers check
liveness of others. With a big number of nodes, if only one server failed to connect by
whatever reasons, global record in JNDI will be updated. But other servers may continue to
see it. While number of cluster nodes will grow, the problem will appear more frequently.
- Alexey
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4072663#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...