Hi Mircea,
I think we need a 3rd option in addition to a retry interval and a
number of attempts, to take a site offline: a min-time (or whatever we
want to call it).
Say we have retry-interval=1000 and maxRetries=5. This means that if we
get a SITE-UNREACHABLE 5 times for a given site, we declare that site
offline and cease sending requests to it.
However, if we have 5 different threads sending requests to the site,
then each of them will increment the counter and thus we take the site
offline after 1 second !
That's where min-time comes in: we should wait at least min-time until
we take any site offline, even if maxRetries has been exceeded.
Example: min-time=60000 (ms), maxRetries=10, retryInterval=1000 (ms)
If we have 20 threads sending requests to site SFO (which is down), then
we might have numRetries=20 after 10 seconds, and perhaps numRetries=60
after 50 seconds. But only once 60 seconds have elapsed do we take SFO
offline.
The main reason for min-time would be to prevent taking a site offline
during a short period of time when the site master changes and multiple
threads incrementing numRetries in short order.
--
Bela Ban, JGroups lead (
http://www.jgroups.org)