[
https://issues.jboss.org/browse/JGRP-1514?page=com.atlassian.jira.plugin....
]
Bela Ban commented on JGRP-1514:
--------------------------------
(Comments from JGRP-1401)
When we have sites {A,B,C} and {X,Y,Z} (with site masters A and X), during the time X
leaves (or crashes) and Y taking over, all messages sent by the first site are not relayed
to the second site.
Because the sites are autonomous, there won't be any retransmission of the dropped
messages.
This can have an adverse affect, e.g. in Infinispan:
- Say key K is stored on A, B and Z
- Now we're updating K, on A and B, but before the change is relayed to the other
site, X crashes
- If there is no rebalancing, e.g. because K is still to be stored on A, B and Z, since
the update on Z was dropped, Z has a stale value !
SOLUTION 1:
- Have a backup coordinator B cache the last N messages in memory (with overflow to disk)
- A numbers relayed messages
- As soon as A has relayed message #50, it sends this info to B. Or, alternatively, this
could be done periodically, or based on the number of relayed messages (e.g. every 10
messages)
- B can then purge those messages
- When A crashes, B runs a reconciliation protocol with X to determine whether to relay
some backed up messages
- C now starts acting as backup relay to B
This solution is probably the simplest to implement, and doesn't require any code
changes in Infinispan. However, there is still a chance of message loss if both the relay
*and* the backup relay crash at the same time.
SOLUTION 2:
- After a crash (not a graceful leave !) of a relay coordinator, there has to be a full
rebalancing of all keys
- This is wasteful though
- May not be needed, perhaps Infinispan could check whether a full rebalancing is required
?
RELAY2: store-and-forward inter-site messages to prevent message loss
when site master crashes
----------------------------------------------------------------------------------------------
Key: JGRP-1514
URL:
https://issues.jboss.org/browse/JGRP-1514
Project: JGroups
Issue Type: Feature Request
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.2
JGRP-1401 deals with crashes of the site master *within* the current site, and developed
FORWARD_TO_COORD to deal with those temporary losses (a new coord will take over), to
prevent message loss.
*This* JIRA is about preventing message loss caused by the site master of a *remote site*
crashing. The general idea is to store-and-forward a message for a certain time and/or a
certain number of attempts to forward.
The most frequent use case is probably that the site master of a remote site left (or
crashed) and the new site master hasn't yet opened the bridge, so the message would be
lost. Store-and-forward should help here.
If we still cannot forward the message for a (configurable) time and/or number of tries,
we'll send a SITE_UNREACHABLE message back to the original sender of the message,
which will then have to deal with it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira