[jboss-jira] [JBoss JIRA] (JGRP-1401) RELAY2: messages lost when relay coordinator crashes

Wednesday, 12 September 2012

     [
https://issues.jboss.org/browse/JGRP-1401?page=com.atlassian.jira.plugin....
]

Bela Ban updated JGRP-1401:
---------------------------

    Description: 
When we have sites LON={A,B,C} and SFO={X,Y,Z}, if C wants to send a unicast message to
the site master of SFO (X), but the *local site master (A)* leaves or crashes, and B
hasn't taken over yet, the message will be lost.

The idea to solve this is to forward the message to the next coordinator if the current
coordinator leaves or dies.

A FORWARD_TO_COORD protocol was developed, which handles this task. RELAY2 checks at
startup if FORWARD_TO_COORD is present and uses a FORWARD event to tell that protocol to
forward a message to the current coordinator. If the protocol is not present, a simple
unicast will be sent (unreliably).

FORWARD_TO_COORD sends a message M to the current coord and removes M when an ack has been
received. If there is a view change, indicating the old coord left, it resends all pending
messages, and so on. The extreme case would be that everyone but the sender dies and then
M would be sent to the sender itself.

  was:
When we have sites {A,B,C} and {X,Y,Z} (with site masters A and X), during the time X
leaves (or crashes) and Y taking over, all messages sent by the first site are not relayed
to the second site.
Because the sites are autonomous, there won't be any retransmission of the dropped
messages.
This can have an adverse affect, e.g. in Infinispan:
- Say key K is stored on A, B and Z
- Now we're updating K, on A and B, but before the change is relayed to the other
site, X crashes
- If there is no rebalancing, e.g. because K is still to be stored on A, B and Z, since
the update on Z was dropped, Z has a stale value !

SOLUTION 1:
- Have a backup coordinator B cache the last N messages in memory (with overflow to disk)
- A numbers relayed messages
- As soon as A has relayed message #50, it sends this info to B. Or, alternatively, this
could be done periodically, or based on the number of relayed messages (e.g. every 10
messages)
- B can then purge those messages
- When A crashes, B runs a reconciliation protocol with X to determine whether to relay
some backed up messages
- C now starts acting as backup relay to B

This solution is probably the simplest to implement, and doesn't require any code
changes in Infinispan. However, there is still a chance of message loss if both the relay
*and* the backup relay crash at the same time.

SOLUTION 2:
- After a crash (not a graceful leave !) of a relay coordinator, there has to be a full
rebalancing of all keys
- This is wasteful though
- May not be needed, perhaps Infinispan could check whether a full rebalancing is required
?

...
 RELAY2: messages lost when relay coordinator crashes
 ----------------------------------------------------

                 Key: JGRP-1401
                 URL: https://issues.jboss.org/browse/JGRP-1401
             Project: JGroups
          Issue Type: Feature Request
            Reporter: Bela Ban
            Assignee: Bela Ban
             Fix For: 3.2

 When we have sites LON={A,B,C} and SFO={X,Y,Z}, if C wants to send a unicast message to
the site master of SFO (X), but the *local site master (A)* leaves or crashes, and B
hasn't taken over yet, the message will be lost.
 The idea to solve this is to forward the message to the next coordinator if the current
coordinator leaves or dies.
 A FORWARD_TO_COORD protocol was developed, which handles this task. RELAY2 checks at
startup if FORWARD_TO_COORD is present and uses a FORWARD event to tell that protocol to
forward a message to the current coordinator. If the protocol is not present, a simple
unicast will be sent (unreliably).
 FORWARD_TO_COORD sends a message M to the current coord and removes M when an ack has
been received. If there is a view change, indicating the old coord left, it resends all
pending messages, and so on. The extreme case would be that everyone but the sender dies
and then M would be sent to the sender itself. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1401) RELAY2: messages lost when relay coordinator crashes