[infinispan-issues] [JBoss JIRA] (ISPN-1806) Potential race condition results in StateTransferInProgressException on view change

Mon Jan 30 19:33:48 EST 2012

    [ https://issues.jboss.org/browse/ISPN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662912#comment-12662912 ] 

Dan Berindei commented on ISPN-1806:
------------------------------------

According to the attached log there there is no problem with the commit/StateTransferInProgressException - the transaction thread is blocked because CacheViewsManagerImpl is not able to install a new cache view. I created a separate issue to describe the problem: ISPN-1814.

There is another problem apparent in the log: JGroups apparently didn't exclude the killed cluster member from the view even after it failed to ACK the new view:

{noformat}
20:22:49,552 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-13,null) ISPN000094: Received new cluster view: [node-udp-0/cluster|4] [node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster]
20:22:52,497 WARNING [org.jgroups.protocols.pbcast.GMS] (pool-5-thread-1) JOIN(node-udp-1/cluster) sent to node-udp-0/cluster timed out (after 3000 ms), retrying
20:22:54,549 WARNING [org.jgroups.protocols.pbcast.GMS] (ViewHandler,cluster,node-udp-0/cluster) node-udp-0/cluster: failed to collect all ACKs (expected=2) for view [node-udp-0/cluster|4] after 5000ms, missing ACKs from [node-udp-1/cluster]
20:22:54,835 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (pool-9-thread-1) ISPN000094: Received new cluster view: [node-udp-0/cluster|4] [node-udp-0/cluster, 1903f28a-f7d1-1488-62a3-03a2df0f5b62, node-udp-1/cluster]
...
20:23:20,174 INFO  [org.jboss.as.clustering.CoreGroupCommunicationService.cluster] (VERIFY_SUSPECT.TimerThread,cluster,node-udp-1/cluster) JBAS010254: Suspected member: 1903f28a-f7d1-1488-62a3-03a2df0f5b62
{noformat}

I'm not sure why this happens - is FD/FD_ALL enabled in the JGroups configuration?

And finally, there are some issues with the log itself:
* The log messages from Incoming threads have a 'null' instead of the node name. Not sure if it's related to the (caught) exception in {{CoreGroupCommunicationService$MembershipListenerImpl.viewAccepted}}, since the OOB threads look fine.
* The restarted node should start each time with a different name, it's hard to understand what's happening with two {{node-udp-1/cluster}} s in the same cluster.

> Potential race condition results in StateTransferInProgressException on view change
> -----------------------------------------------------------------------------------
>
>                 Key: ISPN-1806
>                 URL: https://issues.jboss.org/browse/ISPN-1806
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State transfer
>    Affects Versions: 5.1.0.FINAL
>            Reporter: Paul Ferraro
>            Assignee: Dan Berindei
>            Priority: Critical
>         Attachments: org.jboss.as.test.clustering.unmanaged.singleton.SingletonTestCase-output.txt
>
>
> I'm not sure yet if this is an Infinispan or AS bug.  In summary, I'm performing cache operations from a @ViewChanged event.  Occasionally this results in an endless loop of "Failed to prepare view CacheView" error messages and upon timeout, a StateTransferInProgressException.  I've attached the server log containing the eventual thread dump.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira