[JBoss JIRA] (ISPN-6384) JGroupsTransport.invokeRemotelyAsync with a filter returns null on timeout
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6384?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant resolved ISPN-6384.
-----------------------------------
Resolution: Done
> JGroupsTransport.invokeRemotelyAsync with a filter returns null on timeout
> --------------------------------------------------------------------------
>
> Key: ISPN-6384
> URL: https://issues.jboss.org/browse/ISPN-6384
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.Final, 9.0.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.1.4.Final, 9.0.0.Final, 9.0.0.Alpha1, 8.2.1.Final
>
>
> {{JGroupsTransport.invokeRemotelyAsync()}} has a {{ResponseFilter}} parameter that was traditionally used only with {{ResponseMode.GET_FIRST}}, for remote get commands. In that particular case, returning a {{null}} when some of the nodes timed out and other nodes returned invalid responses (i.e. {{null}}) was acceptable.
> Since ISPN-4979, {{JGroupsTransport.invokeRemotelyAsync()}} is also used by {{ClusterTopologyManagerImpl}}, with {{ResponseMode.GET_ALL}}. Here, however, returning a {{null}} instead of throwing a {{TimeoutException}} is not acceptable.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6518) org.infinispan.transaction.xa.GlobalTransaction objects are not cleared properly
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6518?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-6518:
----------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> org.infinispan.transaction.xa.GlobalTransaction objects are not cleared properly
> --------------------------------------------------------------------------------
>
> Key: ISPN-6518
> URL: https://issues.jboss.org/browse/ISPN-6518
> Project: Infinispan
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 8.1.3.Final, 8.2.1.Final, 9.0.0.Alpha1
> Reporter: Matej Čimbora
> Assignee: Pedro Ruivo
> Priority: Critical
> Fix For: 9.0.0.Alpha2, 8.1.4.Final, 8.2.2.Final, 9.0.0.Final
>
>
> The issue was spotted in 6 hr soak tests and affects both distributed & replicated mode.
> The test shows steady increase in heap usage. After closer examination of jfr recording, it seems that at the end of the test there are almost 22M live org.infinispan.transaction.xa.GlobalTransaction instances in the heap, totaling 667 MB in size.
> Top heap consumers (final stage of the test)
> Class Instances Size(bytes) Percentage of Heap(%)
> byte[] 60,218 830,445,604 35.168
> Class Instances Size(bytes) Percentage of Heap(%)
> org.infinispan.transaction.xa.GlobalTransaction 21,846,176 699,077,616 29.605
> Class Instances Size(bytes) Percentage of Heap(%)
> java.util.concurrent.ConcurrentHashMap$Node 21,795,718 697,462,992 29.537
> Class Instances Size(bytes) Percentage of Heap(%)
> java.util.concurrent.ConcurrentHashMap$Node[] 129 134,352,464 5.69
> Transaction configuration:
> <transaction transaction-manager-lookup="org.infinispan.transaction.lookup.GenericTransactionManagerLookup" mode="NON_DURABLE_XA" />
> Please let me know if you'd like me to share more details from jfr recording.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6341) StateTransferManager should be the first component to stop
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6341?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-6341:
----------------------------------
Fix Version/s: 8.1.4.Final
> StateTransferManager should be the first component to stop
> ----------------------------------------------------------
>
> Key: ISPN-6341
> URL: https://issues.jboss.org/browse/ISPN-6341
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.2.1.Final, 9.0.0.Alpha1, 8.1.4.Final
>
>
> When a cache stops, it first removes the component registry from the {{GlobalComponentsRegistry}}'s {{namedComponents}} map, which means the node (let's call it {{A}}) will reply with a {{CacheNotFoundResponse}} to any remote command.
> Another node {{B}} trying to execute a write/transactional command will receive the {{CacheNotFoundResponse}}, assume that a new cache topology with id {{current topology id + 1}} is coming soon, and wait for that new topology before retrying.
> Normally this is not a problem, because {{StateTransferManagerImpl.stop()}} sends a {{CacheTopologyControlCommand(LEAVE)}} to the coordinator quickly enough, then {{B}} receives the {{current topology id + 1}} topology and retries the command.
> But in some cases, the cache components that stop before {{StateTransferManagerImpl}} can take a long time to do so. In particular, because of {{ISPN-5507}}, {{TransactionTable}} can block for {{cacheStopTimeout}} if there are remote transactions in progress, even though the cache can no longer process remote commands.
> We should give {{StateTransferManagerImpl.stop()}} a priority of {{0}}, so that the {{CacheTopologyControlCommand(LEAVE)}} comand is sent as soon as possible.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6341) StateTransferManager should be the first component to stop
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6341?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant resolved ISPN-6341.
-----------------------------------
Resolution: Done
> StateTransferManager should be the first component to stop
> ----------------------------------------------------------
>
> Key: ISPN-6341
> URL: https://issues.jboss.org/browse/ISPN-6341
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.1.4.Final, 9.0.0.Alpha1, 8.2.1.Final
>
>
> When a cache stops, it first removes the component registry from the {{GlobalComponentsRegistry}}'s {{namedComponents}} map, which means the node (let's call it {{A}}) will reply with a {{CacheNotFoundResponse}} to any remote command.
> Another node {{B}} trying to execute a write/transactional command will receive the {{CacheNotFoundResponse}}, assume that a new cache topology with id {{current topology id + 1}} is coming soon, and wait for that new topology before retrying.
> Normally this is not a problem, because {{StateTransferManagerImpl.stop()}} sends a {{CacheTopologyControlCommand(LEAVE)}} to the coordinator quickly enough, then {{B}} receives the {{current topology id + 1}} topology and retries the command.
> But in some cases, the cache components that stop before {{StateTransferManagerImpl}} can take a long time to do so. In particular, because of {{ISPN-5507}}, {{TransactionTable}} can block for {{cacheStopTimeout}} if there are remote transactions in progress, even though the cache can no longer process remote commands.
> We should give {{StateTransferManagerImpl.stop()}} a priority of {{0}}, so that the {{CacheTopologyControlCommand(LEAVE)}} comand is sent as soon as possible.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6384) JGroupsTransport.invokeRemotelyAsync with a filter returns null on timeout
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6384?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant reopened ISPN-6384:
-----------------------------------
> JGroupsTransport.invokeRemotelyAsync with a filter returns null on timeout
> --------------------------------------------------------------------------
>
> Key: ISPN-6384
> URL: https://issues.jboss.org/browse/ISPN-6384
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.Final, 9.0.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.2.1.Final, 9.0.0.Alpha1, 9.0.0.Final
>
>
> {{JGroupsTransport.invokeRemotelyAsync()}} has a {{ResponseFilter}} parameter that was traditionally used only with {{ResponseMode.GET_FIRST}}, for remote get commands. In that particular case, returning a {{null}} when some of the nodes timed out and other nodes returned invalid responses (i.e. {{null}}) was acceptable.
> Since ISPN-4979, {{JGroupsTransport.invokeRemotelyAsync()}} is also used by {{ClusterTopologyManagerImpl}}, with {{ResponseMode.GET_ALL}}. Here, however, returning a {{null}} instead of throwing a {{TimeoutException}} is not acceptable.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6322) Infinispan can miss incoming commands with JGroupsChannelLookup
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6322?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant reopened ISPN-6322:
-----------------------------------
> Infinispan can miss incoming commands with JGroupsChannelLookup
> ---------------------------------------------------------------
>
> Key: ISPN-6322
> URL: https://issues.jboss.org/browse/ISPN-6322
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.CR1, 8.1.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.2.1.Final, 9.0.0.Alpha1, 9.0.0.Final
>
>
> Normally, the JGroupsTransport startup sequence goes like this:
> # Create the {{Channel}}
> # Create the {{CommandAwareRpcDispatcher}} and install it as an {{UpHandler}}
> # Connect the channel
> This way, every {{RequestCorrelator}} message received by the channel is passed up to {{CommandAwareRpcDispatcher}}, which executes the appropriate command.
> When using a {{JGroupsChannelLookup}}, the lookup implementation is allowed to return a {{Channel}} instance that is already connected ({{shouldConnect() == false}}). That means there is now a window where the channel doesn't have an {{UpHandler}}, and messages sent to this node are discarded.
> Normally a node only receives commands after it sent a join request to the coordinator. There are however a few exceptions:
> # On startup, {{LocalTopologyManagerImpl}} sends the join request to the JGroups coordinator, which may not have the {{UpHandler}} yet. This seems to be responsible for the recent hanging in {{ConcurrentStartTest}}. We have a workaround here, to use a smaller timeout on the {{CacheTopologyControlCommand(JOIN)}} command, and retry it on {{TimeoutException}}.
> # When a node becomes coordinator, {{ClusterTopologyManagerImpl}} broadcasts a {{GET_STATUS}} request to all cluster members, and expects a response from each of them. The same workaround with a smaller timeout and retries might work here.
> # In replicated mode, write commands are broadcasted to all cluster members. There is some commented out code in {{RpcManagerImpl.invokeRemotelyAsync()}} that might fix it by only waiting for responses from the cache topology members.
> We should consider deprecating {{JGroupsChannelLookup.shouldConnect()}} and requiring that the channel is only connected by {{JGroupsTransport}}. Assuming that works with {{ForkChannel}}, of course.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6322) Infinispan can miss incoming commands with JGroupsChannelLookup
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6322?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-6322:
----------------------------------
Fix Version/s: 8.1.4.Final
> Infinispan can miss incoming commands with JGroupsChannelLookup
> ---------------------------------------------------------------
>
> Key: ISPN-6322
> URL: https://issues.jboss.org/browse/ISPN-6322
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.CR1, 8.1.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.2.1.Final, 9.0.0.Alpha1, 8.1.4.Final, 9.0.0.Final
>
>
> Normally, the JGroupsTransport startup sequence goes like this:
> # Create the {{Channel}}
> # Create the {{CommandAwareRpcDispatcher}} and install it as an {{UpHandler}}
> # Connect the channel
> This way, every {{RequestCorrelator}} message received by the channel is passed up to {{CommandAwareRpcDispatcher}}, which executes the appropriate command.
> When using a {{JGroupsChannelLookup}}, the lookup implementation is allowed to return a {{Channel}} instance that is already connected ({{shouldConnect() == false}}). That means there is now a window where the channel doesn't have an {{UpHandler}}, and messages sent to this node are discarded.
> Normally a node only receives commands after it sent a join request to the coordinator. There are however a few exceptions:
> # On startup, {{LocalTopologyManagerImpl}} sends the join request to the JGroups coordinator, which may not have the {{UpHandler}} yet. This seems to be responsible for the recent hanging in {{ConcurrentStartTest}}. We have a workaround here, to use a smaller timeout on the {{CacheTopologyControlCommand(JOIN)}} command, and retry it on {{TimeoutException}}.
> # When a node becomes coordinator, {{ClusterTopologyManagerImpl}} broadcasts a {{GET_STATUS}} request to all cluster members, and expects a response from each of them. The same workaround with a smaller timeout and retries might work here.
> # In replicated mode, write commands are broadcasted to all cluster members. There is some commented out code in {{RpcManagerImpl.invokeRemotelyAsync()}} that might fix it by only waiting for responses from the cache topology members.
> We should consider deprecating {{JGroupsChannelLookup.shouldConnect()}} and requiring that the channel is only connected by {{JGroupsTransport}}. Assuming that works with {{ForkChannel}}, of course.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6322) Infinispan can miss incoming commands with JGroupsChannelLookup
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6322?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant resolved ISPN-6322.
-----------------------------------
Resolution: Done
> Infinispan can miss incoming commands with JGroupsChannelLookup
> ---------------------------------------------------------------
>
> Key: ISPN-6322
> URL: https://issues.jboss.org/browse/ISPN-6322
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.CR1, 8.1.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.1.4.Final, 9.0.0.Final, 9.0.0.Alpha1, 8.2.1.Final
>
>
> Normally, the JGroupsTransport startup sequence goes like this:
> # Create the {{Channel}}
> # Create the {{CommandAwareRpcDispatcher}} and install it as an {{UpHandler}}
> # Connect the channel
> This way, every {{RequestCorrelator}} message received by the channel is passed up to {{CommandAwareRpcDispatcher}}, which executes the appropriate command.
> When using a {{JGroupsChannelLookup}}, the lookup implementation is allowed to return a {{Channel}} instance that is already connected ({{shouldConnect() == false}}). That means there is now a window where the channel doesn't have an {{UpHandler}}, and messages sent to this node are discarded.
> Normally a node only receives commands after it sent a join request to the coordinator. There are however a few exceptions:
> # On startup, {{LocalTopologyManagerImpl}} sends the join request to the JGroups coordinator, which may not have the {{UpHandler}} yet. This seems to be responsible for the recent hanging in {{ConcurrentStartTest}}. We have a workaround here, to use a smaller timeout on the {{CacheTopologyControlCommand(JOIN)}} command, and retry it on {{TimeoutException}}.
> # When a node becomes coordinator, {{ClusterTopologyManagerImpl}} broadcasts a {{GET_STATUS}} request to all cluster members, and expects a response from each of them. The same workaround with a smaller timeout and retries might work here.
> # In replicated mode, write commands are broadcasted to all cluster members. There is some commented out code in {{RpcManagerImpl.invokeRemotelyAsync()}} that might fix it by only waiting for responses from the cache topology members.
> We should consider deprecating {{JGroupsChannelLookup.shouldConnect()}} and requiring that the channel is only connected by {{JGroupsTransport}}. Assuming that works with {{ForkChannel}}, of course.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6341) StateTransferManager should be the first component to stop
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6341?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant reopened ISPN-6341:
-----------------------------------
> StateTransferManager should be the first component to stop
> ----------------------------------------------------------
>
> Key: ISPN-6341
> URL: https://issues.jboss.org/browse/ISPN-6341
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 8.2.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 8.2.1.Final, 9.0.0.Alpha1
>
>
> When a cache stops, it first removes the component registry from the {{GlobalComponentsRegistry}}'s {{namedComponents}} map, which means the node (let's call it {{A}}) will reply with a {{CacheNotFoundResponse}} to any remote command.
> Another node {{B}} trying to execute a write/transactional command will receive the {{CacheNotFoundResponse}}, assume that a new cache topology with id {{current topology id + 1}} is coming soon, and wait for that new topology before retrying.
> Normally this is not a problem, because {{StateTransferManagerImpl.stop()}} sends a {{CacheTopologyControlCommand(LEAVE)}} to the coordinator quickly enough, then {{B}} receives the {{current topology id + 1}} topology and retries the command.
> But in some cases, the cache components that stop before {{StateTransferManagerImpl}} can take a long time to do so. In particular, because of {{ISPN-5507}}, {{TransactionTable}} can block for {{cacheStopTimeout}} if there are remote transactions in progress, even though the cache can no longer process remote commands.
> We should give {{StateTransferManagerImpl.stop()}} a priority of {{0}}, so that the {{CacheTopologyControlCommand(LEAVE)}} comand is sent as soon as possible.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months
[JBoss JIRA] (ISPN-6357) Deadlock during server start
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-6357?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-6357:
----------------------------------
Fix Version/s: 8.1.4.Final
> Deadlock during server start
> ----------------------------
>
> Key: ISPN-6357
> URL: https://issues.jboss.org/browse/ISPN-6357
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Server
> Affects Versions: 8.2.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 8.2.1.Final, 9.0.0.Alpha1, 8.1.4.Final
>
> Attachments: s0.txt, s1.txt, server1.txt, server2.txt
>
>
> This happens frequently when starting servers in parallel, the more servers, the easier to reproduce.
> Attached the stack trace of server1 and server2 after hanging.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
8 years, 8 months