[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1842) Live node dropping out of cluster can cause duplicate message delivery

Mon Jan 17 00:50:49 EST 2011

    [ https://issues.jboss.org/browse/JBMESSAGING-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575832#comment-12575832 ] 

Yong Hao Gao commented on JBMESSAGING-1842:
-------------------------------------------

Solution proposed by Justin:

  -Create a new table (e.g. CREATE TABLE JBM_CLUSTER (NODE_ID INTEGER, PING_TIMESTAMP BIGINT, PRIMARY KEY(NODE_ID))) that would be used to keep track of which nodes in the cluster were still alive.
  -Each node in the cluster would update the table every 30 seconds to indicate it was still alive.
  -When a node is notified by JGroups that its "buddy" failed it will check this table to see if the node is actually dead or not.
  -If the data from the table indicated that the node was, in fact, dead then fail-over would continue as normal.  During fail-over the "dead" node's row would be removed from the table.
  -If the data from the table did not indicate that the node was dead then fail-over would not occur.

> Live node dropping out of cluster can cause duplicate message delivery
> ----------------------------------------------------------------------
>
>                 Key: JBMESSAGING-1842
>                 URL: https://issues.jboss.org/browse/JBMESSAGING-1842
>             Project: JBoss Messaging
>          Issue Type: Bug
>          Components: JMS Clustering
>    Affects Versions: 1.4.0.SP3.CP10
>            Reporter: Justin Bertram
>            Assignee: Yong Hao Gao
>
> When a live node is kicked out of the cluster (for whatever reason) its JBoss Messaging ServerPeer remains active which means the node is still available to send messages to clients.  However, when the node is kicked out of the cluster another node in the cluster performs fail-over for that node and takes ownership of that node's messages in the database.  The "dead" node may know nothing about this and might believe it still owns those messages and therefore will deliver those messages to clients.  After delivery it tries to remove the message from the database and can't (because it doesn't actually own that message anymore).  When this happens the "dead" node issues a WARN like this:
>   WARN  [JDBCPersistenceManager] Failed to remove row for: Reference[23318958991900672]:RELIABLE
> Of course, the node which performed the fail-over and actually owns the message now may also deliver the message to a client.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira