[infinispan-issues] [JBoss JIRA] (ISPN-6666) Infinispan can miss incoming commands with JGroupsChannelLookup

Wed May 18 20:17:01 EDT 2016

Brad Maxwell created ISPN-6666:
----------------------------------

             Summary: Infinispan can miss incoming commands with JGroupsChannelLookup
                 Key: ISPN-6666
                 URL: https://issues.jboss.org/browse/ISPN-6666
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 8.2.0.CR1, 8.1.2.Final
            Reporter: Brad Maxwell
            Assignee: Dan Berindei
             Fix For: 8.2.1.Final, 9.0.0.Alpha1, 8.1.4.Final, 9.0.0.Final

Normally, the JGroupsTransport startup sequence goes like this:

# Create the {{Channel}}
# Create the {{CommandAwareRpcDispatcher}} and install it as an {{UpHandler}}
# Connect the channel

This way, every {{RequestCorrelator}} message received by the channel is passed up to {{CommandAwareRpcDispatcher}}, which executes the appropriate command.

When using a {{JGroupsChannelLookup}}, the lookup implementation is allowed to return a {{Channel}} instance that is already connected ({{shouldConnect() == false}}). That means there is now a window where the channel doesn't have an {{UpHandler}}, and messages sent to this node are discarded.

Normally a node only receives commands after it sent a join request to the coordinator. There are however a few exceptions:

# On startup, {{LocalTopologyManagerImpl}} sends the join request to the JGroups coordinator, which may not have the {{UpHandler}} yet. This seems to be responsible for the recent hanging in {{ConcurrentStartTest}}. We have a workaround here, to use a smaller timeout on the {{CacheTopologyControlCommand(JOIN)}} command, and retry it on {{TimeoutException}}.
# When a node becomes coordinator, {{ClusterTopologyManagerImpl}} broadcasts a {{GET_STATUS}} request to all cluster members, and expects a response from each of them. The same workaround with a smaller timeout and retries might work here.
# In replicated mode, write commands are broadcasted to all cluster members. There is some commented out code in {{RpcManagerImpl.invokeRemotelyAsync()}} that might fix it by only waiting for responses from the cache topology members.

We should consider deprecating {{JGroupsChannelLookup.shouldConnect()}} and requiring that the channel is only connected by {{JGroupsTransport}}. Assuming that works with {{ForkChannel}}, of course.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)