For monitoring, anything that you may have already, going through logs at this scale is tedious. I created a monitor web page that shows the following for each node:
Node name, coordinator, transport cluster name, first heard, last heard, entries, and evictions.
This will tell me right away how many nodes are up and if they are all using the same coordinator. I am running Infinispan in an embedded fashion, so it’s my code that interrogates each cache and reports the information to the monitor server. It’s actually been pretty handy because we can see the nodes that either have not started or have not formed one cluster.