For monitoring, anything that you may have already, going through logs at this scale is tedious. I created a monitor web page that shows the following for each node:

Node name, coordinator, transport cluster name, first heard, last heard, entries, and evictions.

This will tell me right away how many nodes are up and if they are all using the same coordinator. I am running Infinispan in an embedded fashion, so it’s my code that interrogates each cache and reports the information to the monitor server. It’s actually been pretty handy because we can see the nodes that either have not started or have not formed one cluster.

On 18 Mar 2011, at 21:35, Dave wrote:

Won’t be able to get CR4 uploaded, policy dictates that I wait until final release. However, I was able to get 431 nodes up and running as a replicated cluster and 115 nodes up as a distributed cluster. For the 430 node cache, I was able to get it started with no problems about 50% of the time. When they formed multiple clusters they merged together only some of the time. It really does appear to be a startup issue at this point. We have not pushed it hard enough yet to see what happens at this scale under load.

Any idea when CR4 will be FINAL?

Hopefully some time next week.

I have documented the system property on the JIRA.

Are there any tools to help diagnose problems / performance at this scale (I ended up writing my own monitor program)?

Specifically what tools are you after?

Cheers

Manik