[infinispan-issues] [JBoss JIRA] (ISPN-1806) Potential race condition results in StateTransferInProgressException on view change

Tue Feb 7 10:12:49 EST 2012

    [ https://issues.jboss.org/browse/ISPN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664792#comment-12664792 ] 

Dan Berindei commented on ISPN-1806:
------------------------------------

I missed something in the log, in fact FD_SOCK did suspect `node-udp-1` and it got kicked out of the cluster. However, it appears as if the node re-joins the cluster 50ms after it was killed:

{noformat}
20:22:36,602 INFO  [org.jboss.as.clustering.CoreGroupCommunicationService.lifecycle.cluster] (Incoming-9,null) JBAS010247: New cluster view for partition cluster (id: 2, delta: -1, merge: false) : [node-udp-0/cluster]
20:22:36,603 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-9,null) ISPN000094: Received new cluster view: [node-udp-0/cluster|2] [node-udp-0/cluster]

20:22:36,659 INFO  [org.jboss.as.clustering.CoreGroupCommunicationService.lifecycle.cluster] (Incoming-11,null) JBAS010247: New cluster view for partition cluster (id: 3, delta: 1, merge: false) : [node-udp-0/cluster, node-udp-1/cluster]
20:22:36,659 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-11,null) ISPN000094: Received new cluster view: [node-udp-0/cluster|3] [node-udp-0/cluster, node-udp-1/cluster]
{noformat}

After 10 seconds `node-udp-1` is restarted and it properly joins the cluster the second time. The "ghost" of the previous `node-udp-1` instance is still there:

{noformat}
20:22:49,479 INFO  [stdout] (pool-5-thread-1) GMS: address=node-udp-1/cluster, cluster=cluster, physical address=127.0.0.1:55300
20:22:49,549 INFO  [org.jboss.as.clustering.CoreGroupCommunicationService.lifecycle.cluster] (Incoming-13,null) JBAS010247: New cluster view for partition cluster (id: 4, delta: 1, merge: false) : [node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster]
20:22:49,552 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-13,null) ISPN000094: Received new cluster view: [node-udp-0/cluster|4] [node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster
{noformat}

30 seconds after `node-udp-1` was killed, FD kicks it out of the cluster again:

{noformat}
20:23:20,224 INFO  [org.jboss.as.clustering.CoreGroupCommunicationService.lifecycle.cluster] (Incoming-15,null) JBAS010247: New cluster view for partition cluster (id: 5, delta: -1, merge: false) : [node-udp-0/cluster, node-udp-1/cluster]
{noformat}

This looks like it could be a JGroups issue, but I don't think it could have caused the hang-up - the cause is still ISPN-1814.

> Potential race condition results in StateTransferInProgressException on view change
> -----------------------------------------------------------------------------------
>
>                 Key: ISPN-1806
>                 URL: https://issues.jboss.org/browse/ISPN-1806
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.1.0.FINAL
>            Reporter: Paul Ferraro
>            Assignee: Dan Berindei
>            Priority: Critical
>             Fix For: 5.2.0.FINAL
>
>         Attachments: org.jboss.as.test.clustering.unmanaged.singleton.SingletonTestCase-output.txt
>
>
> I'm not sure yet if this is an Infinispan or AS bug.  In summary, I'm performing cache operations from a @ViewChanged event.  Occasionally this results in an endless loop of "Failed to prepare view CacheView" error messages and upon timeout, a StateTransferInProgressException.  I've attached the server log containing the eventual thread dump.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira