[infinispan-issues] [JBoss JIRA] (ISPN-6925) Race condition in staggered gets

Tue Aug 16 12:21:00 EDT 2016

    [ https://issues.jboss.org/browse/ISPN-6925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279870#comment-13279870 ] 

Radim Vansa edited comment on ISPN-6925 at 8/16/16 12:20 PM:
-------------------------------------------------------------

Hi [~pferraro], I can confirm that the messages are properly sent and received (by JGroups) but not processed by Infinispan. Regrettably, there's no further logging that would pinpoint the issue, but given the fact that the responses are successful (therefore there's no issue with copying the content) it could be only this race condition (which I doubt, since the timestamps suggest the events are not interleaved) or unknown bug. I guess you can't reproduce this in an unit test, can you? I would be also interested if you can reproduce with the fix in PR.

{noformat}
node2:
2016-08-16 10:25:55,217 TRACE [org.jgroups.protocols.UDP] (default task-1) node2: sending msg to node1, src=node2, headers are RequestCorrelator: corr_id=200, type=REQ, req_id=42, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=65, TP: [cluster_name=ee]

node1:
2016-08-16 10:25:55,219 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (thread-15,ee,node1) About to send back response SuccessfulResponse{responseValue=ImmortalCacheValue {value=org.wildfly.clustering.web.infinispan.session.SessionCreationMetaDataEntry at 669d8e2a}}  for command ClusteredGetCommand{key=c-TToH8iNOx-6YCwJnHNKzxhNKEc0DenxW12ZMgo, flags=[FORCE_WRITE_LOCK]}

node2:
2016-08-16 10:25:55,220 TRACE [org.jgroups.protocols.UDP] (thread-17,ee,node2) node2: received [dst: node2, src: node1 (4 headers), size=22 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER], headers are RequestCorrelator: corr_id=200, type=RSP, req_id=42, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=65, conn_id=1, TP: [cluster_name=ee]
2016-08-16 10:25:55,225 TRACE [org.jgroups.protocols.UDP] (timeout-thread--p12-t1) node2: sending msg to node3, src=node2, headers are RequestCorrelator: corr_id=200, type=REQ, req_id=43, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=22, conn_id=1, TP: [cluster_name=ee]

node3:
2016-08-16 10:25:55,228 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (thread-18,ee,node3) About to send back response SuccessfulResponse{responseValue=ImmortalCacheValue {value=org.wildfly.clustering.web.infinispan.session.SessionCreationMetaDataEntry at 2d3471f8}}  for command ClusteredGetCommand{key=c-TToH8iNOx-6YCwJnHNKzxhNKEc0DenxW12ZMgo, flags=[FORCE_WRITE_LOCK]}

2016-08-16 10:25:55,232 TRACE [org.jgroups.protocols.UDP] (thread-17,ee,node2) node2: received [dst: node2, src: node3 (4 headers), size=22 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER], headers are RequestCorrelator: corr_id=200, type=RSP, req_id=43, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=20, conn_id=1, TP: [cluster_name=ee]
{noformat}

was (Author: rvansa):
Hi [~pferraro], I can confirm that the messages are properly sent and received (by JGroups) but not processed by Infinispan. Regrettably, there's no further logging that would pinpoint the issue, but given the fact that the responses are successful (therefore there's no issue with copying the content) it could be only this race condition (which I doubt, since the timestamps suggest the events are not interleaved) or unknown bug. I guess you can't reproduce this in an unit test, can you? I would be also interested if you can reproduce with the fix in PR.

node2:
2016-08-16 10:25:55,217 TRACE [org.jgroups.protocols.UDP] (default task-1) node2: sending msg to node1, src=node2, headers are RequestCorrelator: corr_id=200, type=REQ, req_id=42, rsp_expected=true, FORK: ee:web
, UNICAST3: DATA, seqno=65, TP: [cluster_name=ee]

node1:
2016-08-16 10:25:55,219 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (thread-15,ee,node1) About to send back response SuccessfulResponse{responseValue=ImmortalCacheValue {value=org.wildfly.clustering.web.infinispan.session.SessionCreationMetaDataEntry at 669d8e2a}}  for command ClusteredGetCommand{key=c-TToH8iNOx-6YCwJnHNKzxhNKEc0DenxW12ZMgo, flags=[FORCE_WRITE_LOCK]}

node2:
2016-08-16 10:25:55,220 TRACE [org.jgroups.protocols.UDP] (thread-17,ee,node2) node2: received [dst: node2, src: node1 (4 headers), size=22 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER], headers are RequestCorrelator: corr_id=200, type=RSP, req_id=42, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=65, conn_id=1, TP: [cluster_name=ee]
2016-08-16 10:25:55,225 TRACE [org.jgroups.protocols.UDP] (timeout-thread--p12-t1) node2: sending msg to node3, src=node2, headers are RequestCorrelator: corr_id=200, type=REQ, req_id=43, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=22, conn_id=1, TP: [cluster_name=ee]

node3:
2016-08-16 10:25:55,228 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (thread-18,ee,node3) About to send back response SuccessfulResponse{responseValue=ImmortalCacheValue {value=org.wildfly.clustering.web.infinispan.session.SessionCreationMetaDataEntry at 2d3471f8}}  for command ClusteredGetCommand{key=c-TToH8iNOx-6YCwJnHNKzxhNKEc0DenxW12ZMgo, flags=[FORCE_WRITE_LOCK]}

2016-08-16 10:25:55,232 TRACE [org.jgroups.protocols.UDP] (thread-17,ee,node2) node2: received [dst: node2, src: node3 (4 headers), size=22 bytes, flags=OOB|DONT_BUNDLE|NO_TOTAL_ORDER], headers are RequestCorrelator: corr_id=200, type=RSP, req_id=43, rsp_expected=true, FORK: ee:web, UNICAST3: DATA, seqno=20, conn_id=1, TP: [cluster_name=ee]

> Race condition in staggered gets
> --------------------------------
>
>                 Key: ISPN-6925
>                 URL: https://issues.jboss.org/browse/ISPN-6925
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 9.0.0.Alpha3, 8.2.3.Final
>            Reporter: Radim Vansa
>            Assignee: Radim Vansa
>            Priority: Critical
>         Attachments: server.log.node1, server.log.node2, server.log.node3
>
>
> There's a race condition in {{CommandAwareRpcDispatcher}}, as we do staggered gets. When the {{RspList}} is prepared, and then in {{processCallsStaggered$lambda}} the {{Rsp}} is filled in - both of them can set is as received but later see that the other response was not received yet, because there's no memory barrieri n between the {{setValue}}/{{setException}} and checking {{wasReceived}}.
> The race above happens when two responses come but none of them is accepted by the filter, but there's a second one in JGroupsTransport when the first response is accepted but then comes another one. In {{JGroupsTransport.invokeRemotelyAsync}} in the lambda handling {{rspListFuture.thenApply}} we may see another thread concurrently modifying the rsps; e.g. in {{checkRsp}} you find out that the concurrently written response was received and it's not an exception according to flags, but the value will be null, so you return null while you can have valid response in the other {{Rsp}}.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)