Parallel FD
-----------
Key: JGRP-395
URL:
http://jira.jboss.com/jira/browse/JGRP-395
Project: JGroups
Issue Type: Feature Request
Affects Versions: 2.4
Reporter: Bela Ban
Assigned To: Bela Ban
Fix For: 2.5
With FD, when we have N nodes in a cluster and the switch crashes, every node will take
roughly (N-1) * TIMEOUT ms to become a singleton cluster. This is because in regular FD,
we only ping the next-in-line, e.g.
- Cluster is A, B, C, D
- The plug is pulled
- Example B:
- B decides that, after TIMEOUT ms, C is dead and excludes C from the pingable members
- B then starts emitting a SUSPECT(C) until it gets a new view which excludes C
- B switches to pinging D
- After TIMEOUT ms, it switches to A
- When all of C, D and A have been excluded, B decides to become a singleton cluster (and
coordinator in it)
SOLUTION:
- Nodes don't actively ping other nodes. Instead, each nodes periodically multicasts a
HEARTBEAT to the cluster
- The HEARTBEAT is suppressed when a node sends data, because data counts as a heartbeat
as well
- Every node maintains a table of nodes and the last time we received either a message or
a HEARTBEAT from that node
- The counter is updated with the current time whenever that is the case
- Periodically, we check whether any node has not sent us data/heartbeat for more the
timeout ms. If so, we suspect it
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira