]
Sebastian Łaskawiec closed ISPN-6322.
-------------------------------------
Infinispan can miss incoming commands with JGroupsChannelLookup
---------------------------------------------------------------
Key: ISPN-6322
URL:
https://issues.jboss.org/browse/ISPN-6322
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 8.2.0.CR1, 8.1.2.Final
Reporter: Dan Berindei
Assignee: Dan Berindei
Fix For: 8.2.1.Final, 9.0.0.Alpha1, 9.0.0.Final
Normally, the JGroupsTransport startup sequence goes like this:
# Create the {{Channel}}
# Create the {{CommandAwareRpcDispatcher}} and install it as an {{UpHandler}}
# Connect the channel
This way, every {{RequestCorrelator}} message received by the channel is passed up to
{{CommandAwareRpcDispatcher}}, which executes the appropriate command.
When using a {{JGroupsChannelLookup}}, the lookup implementation is allowed to return a
{{Channel}} instance that is already connected ({{shouldConnect() == false}}). That means
there is now a window where the channel doesn't have an {{UpHandler}}, and messages
sent to this node are discarded.
Normally a node only receives commands after it sent a join request to the coordinator.
There are however a few exceptions:
# On startup, {{LocalTopologyManagerImpl}} sends the join request to the JGroups
coordinator, which may not have the {{UpHandler}} yet. This seems to be responsible for
the recent hanging in {{ConcurrentStartTest}}. We have a workaround here, to use a smaller
timeout on the {{CacheTopologyControlCommand(JOIN)}} command, and retry it on
{{TimeoutException}}.
# When a node becomes coordinator, {{ClusterTopologyManagerImpl}} broadcasts a
{{GET_STATUS}} request to all cluster members, and expects a response from each of them.
The same workaround with a smaller timeout and retries might work here.
# In replicated mode, write commands are broadcasted to all cluster members. There is
some commented out code in {{RpcManagerImpl.invokeRemotelyAsync()}} that might fix it by
only waiting for responses from the cache topology members.
We should consider deprecating {{JGroupsChannelLookup.shouldConnect()}} and requiring
that the channel is only connected by {{JGroupsTransport}}. Assuming that works with
{{ForkChannel}}, of course.