[jboss-jira] [JBoss JIRA] (JGRP-1401) RELAY2: messages lost when relay coordinator crashes

Tue Aug 28 09:17:15 EDT 2012

     [ https://issues.jboss.org/browse/JGRP-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban updated JGRP-1401:
---------------------------

        Summary: RELAY2: messages lost when relay coordinator crashes  (was: RELAY: messages lost when relay coordinator crashes)
    Description: 
When we have sites {A,B,C} and {X,Y,Z} (with site masters A and X), during the time X leaves (or crashes) and Y taking over, all messages sent by the first site are not relayed to the second site.
Because the sites are autonomous, there won't be any retransmission of the dropped messages.
This can have an adverse affect, e.g. in Infinispan:
- Say key K is stored on A, B and Z
- Now we're updating K, on A and B, but before the change is relayed to the other site, X crashes
- If there is no rebalancing, e.g. because K is still to be stored on A, B and Z, since the update on Z was dropped, Z has a stale value !

SOLUTION 1:
- Have a backup coordinator B cache the last N messages in memory (with overflow to disk)
- A numbers relayed messages
- As soon as A has relayed message #50, it sends this info to B. Or, alternatively, this could be done periodically, or based on the number of relayed messages (e.g. every 10 messages)
- B can then purge those messages
- When A crashes, B runs a reconciliation protocol with X to determine whether to relay some backed up messages
- C now starts acting as backup relay to B

This solution is probably the simplest to implement, and doesn't require any code changes in Infinispan. However, there is still a chance of message loss if both the relay *and* the backup relay crash at the same time.

SOLUTION 2:
- After a crash (not a graceful leave !) of a relay coordinator, there has to be a full rebalancing of all keys
- This is wasteful though
- May not be needed, perhaps Infinispan could check whether a full rebalancing is required ?

  was:
When we have sites {A,B,C} and {X,Y,Z} (with relay coords A and X), during the time X leaves (or crashes) and Y taking over, all messages sent by the first site are not relayed to the second site.
Because the sites are autonomous, there won't be any retransmission of the dropped messages.
This can have an adverse affect, e.g. in Infinispan:
- Say key K is stored on A, B and Z
- Now we're updating K, on A and B, but before the change is relayed to the other site, X crashes
- If there is no rebalancing, e.g. because K is still to be stored on A, B and Z, since the update on Z was dropped, Z has a stale value !

SOLUTION 1:
- Have a backup coordinator B cache the last N messages in memory (with overflow to disk)
- A numbers relayed messages
- As soon as A has relayed message #50, it sends this info to B. Or, alternatively, this could be done periodically, or based on the number of relayed messages (e.g. every 10 messages)
- B can then purge those messages
- When A crashes, B runs a reconciliation protocol with X to determine whether to relay some backed up messages
- C now starts acting as backup relay to B

This solution is probably the simplest to implement, and doesn't require any code changes in Infinispan. However, there is still a chance of message loss if both the relay *and* the backup relay crash at the same time.

SOLUTION 2:
- After a crash (not a graceful leave !) of a relay coordinator, there has to be a full rebalancing of all keys
- This is wasteful though
- May not be needed, perhaps Infinispan could check whether a full rebalancing is required ?

Changingto RELAY2, also changed comments

> RELAY2: messages lost when relay coordinator crashes
> ----------------------------------------------------
>
>                 Key: JGRP-1401
>                 URL: https://issues.jboss.org/browse/JGRP-1401
>             Project: JGroups
>          Issue Type: Feature Request
>            Reporter: Bela Ban
>            Assignee: Bela Ban
>             Fix For: 3.2
>
>
> When we have sites {A,B,C} and {X,Y,Z} (with site masters A and X), during the time X leaves (or crashes) and Y taking over, all messages sent by the first site are not relayed to the second site.
> Because the sites are autonomous, there won't be any retransmission of the dropped messages.
> This can have an adverse affect, e.g. in Infinispan:
> - Say key K is stored on A, B and Z
> - Now we're updating K, on A and B, but before the change is relayed to the other site, X crashes
> - If there is no rebalancing, e.g. because K is still to be stored on A, B and Z, since the update on Z was dropped, Z has a stale value !
> SOLUTION 1:
> - Have a backup coordinator B cache the last N messages in memory (with overflow to disk)
> - A numbers relayed messages
> - As soon as A has relayed message #50, it sends this info to B. Or, alternatively, this could be done periodically, or based on the number of relayed messages (e.g. every 10 messages)
> - B can then purge those messages
> - When A crashes, B runs a reconciliation protocol with X to determine whether to relay some backed up messages
> - C now starts acting as backup relay to B
> This solution is probably the simplest to implement, and doesn't require any code changes in Infinispan. However, there is still a chance of message loss if both the relay *and* the backup relay crash at the same time.
> SOLUTION 2:
> - After a crash (not a graceful leave !) of a relay coordinator, there has to be a full rebalancing of all keys
> - This is wasteful though
> - May not be needed, perhaps Infinispan could check whether a full rebalancing is required ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira