[infinispan-issues] [JBoss JIRA] (ISPN-6925) Race condition in staggered gets

Tue Aug 16 10:31:00 EDT 2016

    [ https://issues.jboss.org/browse/ISPN-6925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279736#comment-13279736 ] 

Paul Ferraro edited comment on ISPN-6925 at 8/16/16 10:30 AM:
--------------------------------------------------------------

Attached logs files from 3 WF nodes.  Request is initiated from node2 after view is stable.  ClusteredGetCommand is sent to node1 at 2016-08-16 10:25:55,216 and to node3 at 2016-08-16 10:25:55,225.  As indicated in the node1 & node3 logs, the command is received, executed, and a response is sent, as expected.  However, node2 eventually logs this:
{noformat}
016-08-16 10:26:10,216 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (timeout-thread--p12-t1) Responses: [sender=node3, received=false, suspected=false]
[sender=node1, received=false, suspected=false]
{noformat}

was (Author: pferraro):
Attached logs files from 3 WF nodes.  Request is initiated from node2 after view is stable.  ClusteredGetCommand is sent to node1 at 2016-08-16 09:51:39,885 and to node3 at 2016-08-16 09:51:39,894.  As indicated in the node1 & node3 logs, the command is received, executed, and a response is sent, as expected.  However, node2 eventually logs this:
{noformat}
2016-08-16 09:51:54,885 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (timeout-thread--p11-t1) Responses: [sender=node3, received=false, suspected=false]
[sender=node1, received=false, suspected=false]
{noformat}

> Race condition in staggered gets
> --------------------------------
>
>                 Key: ISPN-6925
>                 URL: https://issues.jboss.org/browse/ISPN-6925
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 9.0.0.Alpha3, 8.2.3.Final
>            Reporter: Radim Vansa
>            Assignee: Radim Vansa
>            Priority: Critical
>         Attachments: server.log.node1, server.log.node2, server.log.node3
>
>
> There's a race condition in {{CommandAwareRpcDispatcher}}, as we do staggered gets. When the {{RspList}} is prepared, and then in {{processCallsStaggered$lambda}} the {{Rsp}} is filled in - both of them can set is as received but later see that the other response was not received yet, because there's no memory barrieri n between the {{setValue}}/{{setException}} and checking {{wasReceived}}.
> The race above happens when two responses come but none of them is accepted by the filter, but there's a second one in JGroupsTransport when the first response is accepted but then comes another one. In {{JGroupsTransport.invokeRemotelyAsync}} in the lambda handling {{rspListFuture.thenApply}} we may see another thread concurrently modifying the rsps; e.g. in {{checkRsp}} you find out that the concurrently written response was received and it's not an exception according to flags, but the value will be null, so you return null while you can have valid response in the other {{Rsp}}.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)