[JBoss JIRA] (ISPN-4484) Outbound transfers can be cancelled by old CANCEL_STATE_TRANSFER command
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4484?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4484:
-----------------------------------------------
Dave Stahl <dstahl(a)redhat.com> changed the Status of [bug 1104045|https://bugzilla.redhat.com/show_bug.cgi?id=1104045] from MODIFIED to ON_QA
> Outbound transfers can be cancelled by old CANCEL_STATE_TRANSFER command
> ------------------------------------------------------------------------
>
> Key: ISPN-4484
> URL: https://issues.jboss.org/browse/ISPN-4484
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Core, State Transfer
> Affects Versions: 6.0.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.0.0.Alpha5
>
>
> This appeared during the 32-nodes elasticity test in the Hyperion environment.
> Just as apex947 left, it started a rebalance, which apex948 dutifully cancelled as it became the new coordinator. apex949 had already requested segments from apex959, so it sent a StateRequestCommand(CANCEL_STATE_TRANSFER) asynchronously to apex959. Then apex948 started a new rebalance, and apex949 asked apex959 for the same segments. When apex959 finally received the cancel request, it didn't check the topology id and it incorrectly cancelled the outbound transfer to apex949.
> The solution would be to verify the topology id in the CANCEL_STATE_TRANSFER command before cancelling the transfer. I also think we can avoid sending the cancel command completely in this case, and only send it as we are about to stop.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 6 months
[JBoss JIRA] (ISPN-4471) MapReduceTask: memory leak with useIntermediateSharedCache = true
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4471?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4471:
-----------------------------------------------
Dave Stahl <dstahl(a)redhat.com> changed the Status of [bug 1116979|https://bugzilla.redhat.com/show_bug.cgi?id=1116979] from MODIFIED to ON_QA
> MapReduceTask: memory leak with useIntermediateSharedCache = true
> -----------------------------------------------------------------
>
> Key: ISPN-4471
> URL: https://issues.jboss.org/browse/ISPN-4471
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Distributed Execution and Map/Reduce
> Affects Versions: 6.0.2.Final
> Reporter: Rich DiCroce
> Assignee: Vladimir Blagojevic
> Fix For: 7.0.0.Alpha5
>
>
> When using an intermediate shared cache for the reduce phase, MapReduceTask puts the entries into the cache with no expiration and apparently never removes them. This eventually results in OutOfMemoryErrors.
> One workaround is to disable use of the intermediate shared cache, which causes a new cache to be created and destroyed for every task, which "fixes" the problem of not removing intermediate values. However, it causes a ton of log spam:
> {noformat}
> 2014-07-02 11:55:10,014 INFO [org.infinispan.jmx.CacheJmxRegistration] (transport-thread-21) ISPN000031: MBeans were successfully registered to the platform MBean server.
> 2014-07-02 11:55:10,016 INFO [org.jboss.as.clustering.infinispan] (transport-thread-21) JBAS010281: Started e71dddc0-60ce-4cb9-ac8c-615d60866393 cache from GamingPortal container
> 2014-07-02 11:55:10,023 INFO [org.infinispan.jmx.CacheJmxRegistration] (transport-thread-5) ISPN000031: MBeans were successfully registered to the platform MBean server.
> 2014-07-02 11:55:10,024 INFO [org.infinispan.jmx.CacheJmxRegistration] (transport-thread-4) ISPN000031: MBeans were successfully registered to the platform MBean server.
> 2014-07-02 11:55:10,025 INFO [org.jboss.as.clustering.infinispan] (transport-thread-5) JBAS010281: Started 22d387d6-69c6-48b2-9701-ea64c08d66ad cache from GamingPortal container
> 2014-07-02 11:55:10,026 INFO [org.jboss.as.clustering.infinispan] (transport-thread-4) JBAS010281: Started bfaf92a0-a030-4624-93a7-0fee097415d7 cache from NMS container
> 2014-07-02 11:55:10,037 INFO [org.jboss.as.clustering.infinispan] (EJB default - 2) JBAS010282: Stopped 22d387d6-69c6-48b2-9701-ea64c08d66ad cache from GamingPortal container
> 2014-07-02 11:55:10,040 INFO [org.jboss.as.clustering.infinispan] (EJB default - 1) JBAS010282: Stopped bfaf92a0-a030-4624-93a7-0fee097415d7 cache from NMS container
> 2014-07-02 11:55:10,047 INFO [org.jboss.as.clustering.infinispan] (EJB default - 6) JBAS010282: Stopped e71dddc0-60ce-4cb9-ac8c-615d60866393 cache from GamingPortal container
> 2014-07-02 11:55:10,047 INFO [org.infinispan.jmx.CacheJmxRegistration] (transport-thread-0) ISPN000031: MBeans were successfully registered to the platform MBean server.
> 2014-07-02 11:55:10,048 INFO [org.jboss.as.clustering.infinispan] (transport-thread-0) JBAS010281: Started bed74bd3-a227-43e0-b262-62c19dd444a7 cache from GamingPortal container
> 2014-07-02 11:55:10,052 INFO [org.jboss.as.clustering.infinispan] (EJB default - 2) JBAS010282: Stopped bed74bd3-a227-43e0-b262-62c19dd444a7 cache from GamingPortal container
> 2014-07-02 11:55:10,063 INFO [org.infinispan.jmx.CacheJmxRegistration] (transport-thread-7) ISPN000031: MBeans were successfully registered to the platform MBean server.
> 2014-07-02 11:55:10,064 INFO [org.jboss.as.clustering.infinispan] (transport-thread-7) JBAS010281: Started 63cce570-0169-40c2-bc9f-e045c2864702 cache from GamingPortal container
> 2014-07-02 11:55:10,068 INFO [org.jboss.as.clustering.infinispan] (EJB default - 2) JBAS010282: Stopped 63cce570-0169-40c2-bc9f-e045c2864702 cache from GamingPortal container
> 2014-07-02 11:55:10,072 INFO [org.infinispan.jmx.CacheJmxRegistration] (transport-thread-19) ISPN000031: MBeans were successfully registered to the platform MBean server.
> 2014-07-02 11:55:10,073 INFO [org.jboss.as.clustering.infinispan] (transport-thread-19) JBAS010281: Started 83f7b355-d4c6-4a0a-aade-ce2509293d77 cache from GamingPortal container
> 2014-07-02 11:55:10,077 INFO [org.jboss.as.clustering.infinispan] (EJB default - 2) JBAS010282: Stopped 83f7b355-d4c6-4a0a-aade-ce2509293d77 cache from GamingPortal container
> {noformat}
> I also observed one NullPointerException with distributeReducePhase = true and useIntermediateSharedCache = false. This could be related to ISPN-4460, but I'm not sure.
> {noformat}
> Caused by: org.infinispan.commons.CacheException: java.util.concurrent.ExecutionException: org.infinispan.commons.CacheException: java.lang.NullPointerException
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:348) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:634) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask$3.call(MapReduceTask.java:652) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask$MapReduceTaskFuture.get(MapReduceTask.java:760) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> ... 63 more
> Caused by: java.util.concurrent.ExecutionException: org.infinispan.commons.CacheException: java.lang.NullPointerException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122) [rt.jar:1.7.0_45]
> at java.util.concurrent.FutureTask.get(FutureTask.java:188) [rt.jar:1.7.0_45]
> at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:845) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhase(MapReduceTask.java:439) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:342) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> ... 66 more
> Caused by: org.infinispan.commons.CacheException: java.lang.NullPointerException
> at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:100) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask$MapTaskPart.invokeMapCombineLocally(MapReduceTask.java:967) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask$MapTaskPart.access$200(MapReduceTask.java:894) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask$MapTaskPart$1.call(MapReduceTask.java:916) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceTask$MapTaskPart$1.call(MapReduceTask.java:912) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_45]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_45]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_45]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_45]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_45]
> at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45]
> Caused by: java.lang.NullPointerException
> at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapKeysToNodes(MapReduceManagerImpl.java:355) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.migrateIntermediateKeys(MapReduceManagerImpl.java:264) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.combine(MapReduceManagerImpl.java:258) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> at org.infinispan.distexec.mapreduce.MapReduceManagerImpl.mapAndCombineForDistributedReduction(MapReduceManagerImpl.java:98) [infinispan-core-6.0.2.Final.jar:6.0.2.Final]
> ... 10 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 6 months
[JBoss JIRA] (ISPN-4480) Messages sent to leavers can clog the JGroups bundler thread
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4480?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4480:
-----------------------------------------------
Dave Stahl <dstahl(a)redhat.com> changed the Status of [bug 1104045|https://bugzilla.redhat.com/show_bug.cgi?id=1104045] from MODIFIED to ON_QA
> Messages sent to leavers can clog the JGroups bundler thread
> ------------------------------------------------------------
>
> Key: ISPN-4480
> URL: https://issues.jboss.org/browse/ISPN-4480
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Core
> Affects Versions: 6.0.2.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
>
> In a stress test that repeatedly kills nodes while performing read/write operations, the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses:
> {noformat}
> 06:40:10,316 WARN [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666:
> java.lang.Thread.sleep(Native Method)
> org.jgroups.util.Util.sleep(Util.java:1504)
> org.jgroups.util.Util.sleepRandom(Util.java:1574)
> org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685)
> org.jgroups.protocols.TP.doSend(TP.java:1670)
> org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476)
> org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392)
> org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383)
> java.lang.Thread.run(Thread.java:744)
> {noformat}
> There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+: JGRP-1814, JGRP-1815
> There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of JGRP-1815: JGRP-1858
> We can work around the problem by changing the JGroups configuration:
> * TP.logical_addr_cache_expiration=86400000
> ** Only expire addresses after 1 day
> * TP.physical_addr_max_fetch_attempts=1
> ** Sleep for only 20ms waiting for the physical address (default 3 - 1500ms)
> * UNICAST3_conn_close_timeout=10000
> ** Drop the pending messages to leavers sooner
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 6 months
[JBoss JIRA] (ISPN-4440) Remove setMaxCollectorSize API from MapReduceTask
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4440?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4440:
-----------------------------------------------
Dave Stahl <dstahl(a)redhat.com> changed the Status of [bug 1116979|https://bugzilla.redhat.com/show_bug.cgi?id=1116979] from MODIFIED to ON_QA
> Remove setMaxCollectorSize API from MapReduceTask
> -------------------------------------------------
>
> Key: ISPN-4440
> URL: https://issues.jboss.org/browse/ISPN-4440
> Project: Infinispan
> Issue Type: Task
> Security Level: Public(Everyone can see)
> Components: Distributed Execution and Map/Reduce
> Affects Versions: 7.0.0.Alpha4
> Reporter: Vladimir Blagojevic
> Assignee: Vladimir Blagojevic
> Priority: Minor
> Fix For: 7.0.0.Alpha5
>
>
> During the refinement of parallel execution of M/R algorithm we introduced an abstraction maxCollectorSize on the level of MapReduceTask. The ideas was that during execution of map/combine phase, number of intermediate keys/values collected in a Collector could potentially become very large. By limiting size of collector, intermediate key/values are transferred to intermediate cache in batches before reduce phase is executed and OutOfMemoryError issues are avoided as well.
> However, during the extensive performance phase Alan Field, Dan Berindei and I have concluded that maxCollectorSize set to 10000 entries gives the best trade off between performance and memory use. Therefore there is no need to expose this value to MapReduceTask users.
> Having said that there might be some uses cases where holding 10000 intermediate large memory footprint objects might lead to OOM, and in such cases users should allocate more heap to MapReduceTasks. We might consider introducing again this API should such a need arise.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 6 months
[JBoss JIRA] (ISPN-4490) Members can miss the rebalance cancellation on coordinator change
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4490?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4490:
-----------------------------------------------
Dave Stahl <dstahl(a)redhat.com> changed the Status of [bug 1117948|https://bugzilla.redhat.com/show_bug.cgi?id=1117948] from NEW to ON_QA
> Members can miss the rebalance cancellation on coordinator change
> -----------------------------------------------------------------
>
> Key: ISPN-4490
> URL: https://issues.jboss.org/browse/ISPN-4490
> Project: Infinispan
> Issue Type: Bug
> Security Level: Public(Everyone can see)
> Components: Core, State Transfer
> Affects Versions: 7.0.0.Alpha4
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 7.0.0.Alpha5
>
>
> The new coordinator sends first a CH_UPDATE command to cancel the existing rebalance, and then a REBALANCE_START command to start a new rebalance. But the CH_UPDATE command is sent asynchronously, so it's possible for some members to receive it after the REBALANCE_START command.
> If that happens, that node will assume that it will receive the segments it requested for the previous rebalance. But with the ISPN-4484 fix, the provider node cancels the outbound transfer tasks when receiving a CH_UPDATE without a pendingCH, so the state requestor will never receive its segments.
> Even without the ISPN-4484 fix this is a problem, although less obvious. Between the provider node receiving the CH_UPDATE and the REBALANCE_START commands, it won't have the requestor in its write CH, so the requestor can miss transactions.
--
This message was sent by Atlassian JIRA
(v6.2.6#6264)
10 years, 6 months