[JBoss JIRA] (ISPN-9095) NPE during server shutdown when using scattered cache
by Paul Ferraro (JIRA)
[ https://issues.jboss.org/browse/ISPN-9095?page=com.atlassian.jira.plugin.... ]
Paul Ferraro reassigned ISPN-9095:
----------------------------------
Assignee: (was: Paul Ferraro)
> NPE during server shutdown when using scattered cache
> -----------------------------------------------------
>
> Key: ISPN-9095
> URL: https://issues.jboss.org/browse/ISPN-9095
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.1.Final
> Reporter: Paul Ferraro
>
> We hit NPE when running tests for RFE EAP7-867.
> EAP distribution was built from https://github.com/pferraro/wildfly/tree/scattered .
> Test description: Positive stress test (no failover), 4-node EAP cluster, clients: starting with 400 clients in the beginning, raising the number of clients to 6000 in the end of the test.
> During clean server shutdown in the end of the test, server logged NPE and got stuck:
> {code}
> [JBossINF] [0m[31m07:55:57,643 ERROR [org.infinispan.scattered.impl.ScatteredStateConsumerImpl] (thread-200,ejb,dev214) ISPN000471: Failed processing values received from remote node during rebalance.: java.lang.NullPointerException
> [JBossINF] at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.applyValues(ScatteredStateConsumerImpl.java:505)
> [JBossINF] at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.lambda$getValuesAndApply$8(ScatteredStateConsumerImpl.java:475)
> [JBossINF] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> [JBossINF] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> [JBossINF] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [JBossINF] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
> [JBossINF] at org.infinispan.remoting.transport.AbstractRequest.complete(AbstractRequest.java:66)
> [JBossINF] at org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:56)
> [JBossINF] at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:35)
> [JBossINF] at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:53)
> [JBossINF] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1304)
> [JBossINF] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1207)
> [JBossINF] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$200(JGroupsTransport.java:123)
> [JBossINF] at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.receive(JGroupsTransport.java:1342)
> [JBossINF] at org.jgroups.JChannel.up(JChannel.java:819)
> [JBossINF] at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:134)
> [JBossINF] at org.jgroups.stack.Protocol.up(Protocol.java:340)
> [JBossINF] at org.jgroups.protocols.FORK.up(FORK.java:134)
> [JBossINF] at org.jgroups.protocols.FRAG3.up(FRAG3.java:166)
> [JBossINF] at org.jgroups.protocols.FlowControl.up(FlowControl.java:343)
> [JBossINF] at org.jgroups.protocols.FlowControl.up(FlowControl.java:343)
> [JBossINF] at org.jgroups.protocols.pbcast.GMS.up(GMS.java:864)
> [JBossINF] at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
> [JBossINF] at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1002)
> [JBossINF] at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:728)
> [JBossINF] at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:383)
> [JBossINF] at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:600)
> [JBossINF] at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:119)
> [JBossINF] at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:199)
> [JBossINF] at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:252)
> [JBossINF] at org.jgroups.protocols.MERGE3.up(MERGE3.java:276)
> [JBossINF] at org.jgroups.protocols.Discovery.up(Discovery.java:267)
> [JBossINF] at org.jgroups.protocols.TP.passMessageUp(TP.java:1248)
> [JBossINF] at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
> [JBossINF] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [JBossINF] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [JBossINF] at org.jboss.as.clustering.jgroups.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:52)
> [JBossINF] at java.lang.Thread.run(Thread.java:748)
> [JBossINF]
> {code}
> Scattered cache was configured with bias-lifespan="0".
> Server configuration:
> http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/eap-7x-stress-ses...
> Server link:
> http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/eap-7x-stress-ses...
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 11 months
[JBoss JIRA] (ISPN-9094) ArrayIndexOutOfBoundsException on server using scattered cache
by Paul Ferraro (JIRA)
[ https://issues.jboss.org/browse/ISPN-9094?page=com.atlassian.jira.plugin.... ]
Paul Ferraro moved WFLY-10275 to ISPN-9094:
-------------------------------------------
Project: Infinispan (was: WildFly)
Key: ISPN-9094 (was: WFLY-10275)
Workflow: GIT Pull Request with Triage workflow (was: GIT Pull Request workflow )
Component/s: Core
(was: Clustering)
Affects Version/s: 9.2.1.Final
(was: 13.0.0.Beta1)
> ArrayIndexOutOfBoundsException on server using scattered cache
> ---------------------------------------------------------------
>
> Key: ISPN-9094
> URL: https://issues.jboss.org/browse/ISPN-9094
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.1.Final
> Reporter: Paul Ferraro
> Assignee: Paul Ferraro
>
> We hit ArrayIndexOutOfBoundsException when running tests for RFE EAP7-867.
> EAP distribution was built from {{https://github.com/pferraro/wildfly/tree/scattered}} .
> Test description: Positive stress test (no failover), 4-node EAP cluster, clients: starting with 400 clients in the beginning, raising the number of clients to 6000 in the end of the test.
> Error occured on server dev215 around 7th iteration (can be seen in the performance report, link below):
> {code}
> [JBossINF] [0m[31m04:26:11,708 ERROR [stderr] (transport-thread--p15-t25) Exception in thread "transport-thread--p15-t25" java.lang.ArrayIndexOutOfBoundsException: 129
> [JBossINF] [0m[31m04:26:11,708 ERROR [stderr] (transport-thread--p15-t25) at org.infinispan.scattered.impl.ScatteredVersionManagerImpl.lambda$tryRegularInvalidations$4(ScatteredVersionManagerImpl.java:413)
> [JBossINF] [0m[31m04:26:11,708 ERROR [stderr] (transport-thread--p15-t25) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [JBossINF] [0m[31m04:26:11,708 ERROR [stderr] (transport-thread--p15-t25) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [JBossINF] [0m[31m04:26:11,708 ERROR [stderr] (transport-thread--p15-t25) at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
> [JBossINF] [0m[31m04:26:11,708 ERROR [stderr] (transport-thread--p15-t25) at java.lang.Thread.run(Thread.java:748)
> {code}
> Clients were getting "SocketTimeoutException: Read timed out" exceptions even before the ArrayIndexOutOfBoundsException ocurred, but also after.
> Performance report (accessible only when connected to VPN):
> http://download.eng.brq.redhat.com/scratch/mvinkler/reports/2018-04-19_15...
> One can observe that dev215 CPU usage and network usage dropped after 7th iteration.
> dev215 server log link:
> https://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/eap-7x-stress-se...
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 11 months
[JBoss JIRA] (ISPN-8962) PreferAvailabilityStrategy: Rely less on the stable topology
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-8962?page=com.atlassian.jira.plugin.... ]
Dan Berindei reopened ISPN-8962:
--------------------------------
I changed the algorithm to prefer the topology with the higher topology id when 2 topologies are overlapping, and the topology with the most members when they are completely independent.
I kept the overlapping topologies with a lower topology id and with at least {{numOwners}} extra members for conflict resolution, because it signals the partition with the highest topology id might have lost some values that still exist in the nodes with the lower topology id.
The problem is that this older topology has the most members and ends up being the preferred topology. It's not a big issue when conflict resolution is enabled, but when conflict resolution is disabled it means different nodes will see different values.
This is actually shown by test {{PreferAvailabilityStrategyTest#testMerge1Paused2StableAfterLosingAnotherNode}}, but at the time I wrote it I was only trying to prove that conflict resolution would happen (if enabled).
1. Start with cluster ABC
1. A was paused and keeps the stable topology
1. B and C finished rebalancing, then B was paused
1. Now A has resumed and merges with C
1. The preferred topology should be \[C\] (from C), but it's \[ABC\] (from A)
> PreferAvailabilityStrategy: Rely less on the stable topology
> ------------------------------------------------------------
>
> Key: ISPN-8962
> URL: https://issues.jboss.org/browse/ISPN-8962
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 9.2.0.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 9.2.2.Final, 9.3.0.Alpha1
>
>
> {{PreferAvailabilityStrategy}} checks the size of the stable topology, and only considers cache topologies that are derived from the biggest topology (in size) when picking a post-merge topology.
> Unfortunately, in some situations this algorithm fails pretty badly. If a node has a very long GC pause, when it comes back it will report the old topology *and* the old stable topology. If the rest of the cluster rebalanced, it now has both a smaller current topology and a smaller stable topology.
> Furthermore, the stable topology is updated asynchronously, independent from the current topology. So even if there's a split and the minority partition installs a current topology with fewer members, it may take some time for its stable topology to be updated with fewer members. In fact, it appears that when a rebalance is not needed (e.g. because the partition has a single node), the stable topology is never updated!
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 11 months