Update:
It seems that the problem persists even after setting NewRatio=3 with the throughupt GC.
We therefore tried a different approach altogether by using the CMS collector (in order to minimize stop times) with the following settings on one of the servers :
JAVA_OPTS="-server -Xrs -Xms2048m -Xmx2048m -XX:MaxPermSize=512m -XX:NewSize=300m -XX:ThreadStackSize=160 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSCompactAtFullCollection -Djava.awt.headless=true -Djava.library.path=$JBOSS_NATIVE_DIR -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000"
I had to set NewSize explicitly, since NewRatio is ignored by the CMS collector (I read a post that said it is some sort of JVM bug).
We noticed that during the time the server with the Throughput GC was having a "bad" run, the one with the CMS collector was healthy, although they were both under significant load. The server with the CMS collector recovered fully and the GC was freeing a lot of memory, while the one with the throughput GC did not and had to be restarted.
CMSInitiatingOccupancyFraction was set to 60, which is somewhat conservative; will probably need to experiment on that.