[infinispan-issues] [JBoss JIRA] (ISPN-6322) Infinispan can miss incoming commands with JGroupsChannelLookup

Sat Apr 30 01:26:02 EDT 2016

     [ https://issues.jboss.org/browse/ISPN-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tristan Tarrant updated ISPN-6322:
----------------------------------
    Fix Version/s: 8.1.4.Final


> Infinispan can miss incoming commands with JGroupsChannelLookup
> ---------------------------------------------------------------
>
>                 Key: ISPN-6322
>                 URL: https://issues.jboss.org/browse/ISPN-6322
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 8.2.0.CR1, 8.1.2.Final
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>             Fix For: 8.2.1.Final, 9.0.0.Alpha1, 8.1.4.Final, 9.0.0.Final
>
>
> Normally, the JGroupsTransport startup sequence goes like this:
> # Create the {{Channel}}
> # Create the {{CommandAwareRpcDispatcher}} and install it as an {{UpHandler}}
> # Connect the channel
> This way, every {{RequestCorrelator}} message received by the channel is passed up to {{CommandAwareRpcDispatcher}}, which executes the appropriate command.
> When using a {{JGroupsChannelLookup}}, the lookup implementation is allowed to return a {{Channel}} instance that is already connected ({{shouldConnect() == false}}). That means there is now a window where the channel doesn't have an {{UpHandler}}, and messages sent to this node are discarded.
> Normally a node only receives commands after it sent a join request to the coordinator. There are however a few exceptions:
> # On startup, {{LocalTopologyManagerImpl}} sends the join request to the JGroups coordinator, which may not have the {{UpHandler}} yet. This seems to be responsible for the recent hanging in {{ConcurrentStartTest}}. We have a workaround here, to use a smaller timeout on the {{CacheTopologyControlCommand(JOIN)}} command, and retry it on {{TimeoutException}}.
> # When a node becomes coordinator, {{ClusterTopologyManagerImpl}} broadcasts a {{GET_STATUS}} request to all cluster members, and expects a response from each of them. The same workaround with a smaller timeout and retries might work here.
> # In replicated mode, write commands are broadcasted to all cluster members. There is some commented out code in {{RpcManagerImpl.invokeRemotelyAsync()}} that might fix it by only waiting for responses from the cache topology members.
> We should consider deprecating {{JGroupsChannelLookup.shouldConnect()}} and requiring that the channel is only connected by {{JGroupsTransport}}. Assuming that works with {{ForkChannel}}, of course.


--
This message was sent by Atlassian JIRA
(v6.4.11#64026)