February 2014 - infinispan-dev - Jboss List Archives

Re: [infinispan-dev] MapReduce limitations and suggestions.

by Evangelos Vazaios

On 02/17/2014 10:42 AM, infinispan-dev-request(a)lists.jboss.org wrote: > Hi Etienne > > I was going to suggest using a combiner - the combiner would process the > mapper results from just one node, so you should need at most double the > memory on that node. I guess we could reduce the memory requirements even > more if the combiner could run concurrently with the mapper... Vladimir, > does it sound like a reasonable feature request? > There are algorithms where combiners cannot be applied. > I'm afraid in your situation using a cache store wouldn't help, as the > intermediate values for the same key are stored as a list in a single > entry. So if all cars are red, there would be just one intermediate key in > the intermediate cache, and there would be nothing to evict to the cache > store. Vladimir, do you think we could somehow "chunk" the intermediary > values into multiple entries grouped by the intermediary key, to support > this scenario? > I was thinking a custom cache implementation that maintains the overall size of cache and each key individually and when a threshold is reached it spills things on disk. Note that I am not familiar with the internals of Infinispan, but I think it is doable. Such a cache solves the problem in both cases (when one key is too large to be in memory as my example and the case where the keys assigned to one reducer exceeds its memory). > For reference, though, a limited version of what you're asking for is > already available. You can change the configuration of the intermediary > cache by defining a "__tmpMapReduce" cache in your configuration. That > configuration will be used for all M/R tasks, whether they use the shared > intermediate cache or they create their own. > I have one question about this. if I start two MR tasks at once will these tasks use the same Cache? thus, the intermediate results are going to be mixed?. This cache can be used in order as a test case. Regards, Evangelos > Cheers > Dan > > > > On Mon, Feb 17, 2014 at 10:18 AM, Etienne Riviere > <etienne.riviere(a)unine.ch>wrote: > >> > Hi Radim, >> > >> > I might misunderstand your suggestion but many M/R jobs actually require >> > to run the two phases one after the other, and henceforth to store the >> > intermediate results somewhere. While some may slightly reduce intermediate >> > memory usage by using a combiner function (e.g., the word-count example), I >> > don't see how we can avoid intermediate storage altogether. >> > >> > Thanks, >> > Etienne (leads project -- as Evangelos who initiated the thread) >> > >> > On 17 Feb 2014, at 08:48, Radim Vansa <rvansa(a)redhat.com> wrote: >> > >>> > > I think that the intermediate cache is not required at all. The M/R >>> > > algorithm itself can (and should!) run with memory occupied by the >>> > > result of reduction. The current implementation with Map first and >>> > > Reduce after that will always have these problems, using a cache for >>> > > temporary caching the result is only a workaround. >>> > > >>> > > The only situation when temporary cache could be useful is when the >>> > > result grows linearly (or close to that or even more) with the amount of >>> > > reduced entries. This would be the case for groupBy producing Map<Color, >>> > > List<Entry>> from all entries in cache. Then the task does not scale and >>> > > should be redesigned anyway, but flushing the results into cache backed >>> > > by cache store could help. >>> > > >>> > > Radim >>> > > >>> > > On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote: >>>> > >> Tristan, >>>> > >> >>>> > >> Actually they are not addressed in this pull request but the feature >>>> > >> where custom output cache is used instead of results being returned is >>>> > >> next in the implementation pipeline. >>>> > >> >>>> > >> Evangelos, indeed, depending on a reducer function all intermediate >>>> > >> KOut/VOut pairs might be moved to a single node. How would custom cache >>>> > >> help in this case? >>>> > >> >>>> > >> Regards, >>>> > >> Vladimir >>>> > >> >>>> > >> >>>> > >> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote: >>>>> > >>> Hi Evangelos, >>>>> > >>> >>>>> > >>> you might be interested in looking into a current pull request which >>>>> > >>> addresses some (all?) of these issues >>>>> > >>> >>>>> > >>> https://github.com/infinispan/infinispan/pull/2300 >>>>> > >>> >>>>> > >>> Tristan >>>>> > >>> >>>>> > >>> On 14/02/2014 16:10, Evangelos Vazaios wrote: >>>>>> > >>>> Hello everyone, >>>>>> > >>>> >>>>>> > >>>> I started using the MapReduce implementation of Infinispan and I came >>>>>> > >>>> across some possible limitations. Thus, I want to make some >> > suggestions >>>>>> > >>>> about the MapReduce (MR) implementation of Infinispan. >>>>>> > >>>> Depending on the algorithm, there might be some memory problems, >>>>>> > >>>> especially for intermediate results. >>>>>> > >>>> An example of such a case is group by. Suppose that we have a cluster >>>>>> > >>>> of 2 nodes with 2 GB available. Let a distributed cache, where simple >>>>>> > >>>> car objects (id,brand,colour) are stored and the total size of data is >>>>>> > >>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go >> > to >>>>>> > >>>> only one reducer, as a result an OutOfMemoryException will be thrown. >>>>>> > >>>> >>>>>> > >>>> To overcome these limitations, I propose to add as parameter the name >> > of >>>>>> > >>>> the intermediate cache to be used. This will enable the creation of a >>>>>> > >>>> custom configured cache that deals with the memory limitations. >>>>>> > >>>> >>>>>> > >>>> Another feature that I would like to have is to set the name of the >>>>>> > >>>> output cache. The reasoning behind this is similar to the one >> > mentioned >>>>>> > >>>> above. >>>>>> > >>>> >>>>>> > >>>> I wait for your thoughts on these two suggestions. >>>>>> > >>>> >>>>>> > >>>> Regards, >>>>>> > >>>> Evangelos >>>>>> > >>>> _______________________________________________ >>>>>> > >>>> infinispan-dev mailing list >>>>>> > >>>> infinispan-dev(a)lists.jboss.org >>>>>> > >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>>>> > >>>> >>>>>> > >>>> >>>>> > >>> >>>>> > >>> >>>>> > >>> _______________________________________________ >>>>> > >>> infinispan-dev mailing list >>>>> > >>> infinispan-dev(a)lists.jboss.org >>>>> > >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>> > >> _______________________________________________ >>>> > >> infinispan-dev mailing list >>>> > >> infinispan-dev(a)lists.jboss.org >>>> > >> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> > > >>> > > >>> > > -- >>> > > Radim Vansa <rvansa(a)redhat.com> >>> > > JBoss DataGrid QA >>> > > >>> > > _______________________________________________ >>> > > infinispan-dev mailing list >>> > > infinispan-dev(a)lists.jboss.org >>> > > https://lists.jboss.org/mailman/listinfo/infinispan-d

11 years, 7 months

1
0
0 / 0

HotRod near caches

by Tristan Tarrant

Hi people, this is a bit of a dump of ideas for getting our HotRod client in shape for supporting near caches: - RemoteCaches should have an optional internal cache. This cache should probably be some form of bounded expiration-aware hashmap which would serve as a local copy of data retrieved over the wire. In the past we have advocated the use of combining an EmbeddedCacheManager with a RemoteCacheStore to achieve this, but this is only applicable to Java clients, while we need to think of a solution for our other clients too. - Once remote listeners are in place, a RemoteCache would automatically invalidate entries in the near-cache. - Remote Query should "pass-through" the near-cache, so that entries retrieved from a query would essentially be cached locally following the same semantics. This can be achieved by having the QUERY verb return just the set of matching keys instead of the whole entries - Optionally we can even think about a query cache which would hash the query DSL and store the resulting keys locally so that successive invocations of a cached query wouldn't go through the wire. Matching this with invalidation is probably a tad more complex, and I'd probably avoid going down that path. Tristan

11 years, 7 months

3
2
0 / 0

Remote Query improvements

by Tristan Tarrant

Hi everybody, last week I developed a simple application using Remote Query, and ran into a few issues. Some of them are just technical hurdles, while others have to do with the complexity of the developer experience. Here they are for open discussion: - the schemas registry should be persistent. Alternatively being able to either specify the ProtoBuf schema from the <indexing /> configuration in the server subsystem or use server's deployment processor to "deploy" schemas. - the server should store the single protobuf source schemas to allow for easy inspection/update of each using our management tools. The server itself should then compile the protobuf schemas into the binary representation when any of the source schemas changes. This would require a Java implementation of the ProtoBuf schema compiler, which wouldn't probably be too hard to do with Antlr. - we need to be able to annotate single protobuf fields for indexing (probably by using specially-formatted comments, a la doclets) to avoid indexing all of the fields - since remote query is already imbued with JPA in some form, an interesting project would be to implement a JPA annotation processor which can produce a set of ProtoBuf schemas from JPA-annotated classes. - on top of the above, a ProtoBuf marshaller/unmarshaller which can use the JPA entities directly. Tristan

11 years, 7 months

5
7
0 / 0

Re: [infinispan-dev] UI-Portlet-Plugins

by Tristan Tarrant

Hi Heiko, adding infinispan-dev. Thanks for taking the time to investigate this. One of the things that would need to be "exposed" to such portlets is the ability to link to RHQ views/portlets (e.g. go to a specific service view) so that "drilling-down" would show the appropriate detailed node. Additionally we would like to provide RHQ-specific configuration when installing our "server" plugin, such as cache/containers dynagroups, maybe even a custom initial dashboard. Can it be done ? Tristan On 02/08/2014 07:43 PM, Heiko W.Rupp wrote: > Hey, > > after talking with Tristan Tarrant from Infinispan I got the idea, that we could create a generic Portlet, that > gets its content data as HTML from a server plugin. The server plugin then has access to all the server logic > to do its task and can e.g. compute various stats of an Infinispan cluster. > > The following drawing illustrates that idea: > > > > > Instances of the portlet will call to the selected server plugin and invoking a well known "interface" like "getMessage". > This message will then do the processing and return a HTML-snippet (not a full page), which is then displayed > inside the portlet window. > > Attached are two screen shots from such a portlet + some PoC code. > > > > > This is created in the backend via (abbreviated) > > complexResults.put(new PropertySimple("results", "<h1>Hello World</h1>Welcome to RHQ<br/>Have FUN<br/>Current date: " + date)); > > This is the "generic" config screen: > > > > > The drop down shows the list of plugins available. > > In this PoC, the plugin writer is responsible for creating sane HTML, > if we decided to put that into RHQ, we may want to do some additional > sanitation. I also have no idea about styling the inner content. > > While this is probably not the way for the (long term) future, at least > the backend plugins can be re-used if we move to an Angular-based UI, > so this investment would not be lost. > > Heiko > > > > >

11 years, 7 months

1
0
0 / 0

Dropping AtomicMap/FineGrainedAtomicMap

by Galder Zamarreño

Hi all, Dropping AtomicMap and FineGrainedAtomicMap was discussed last week in the F2F meeting [1]. It's complex and buggy, and we'd recommend people to use the Grouping API instead [2]. Grouping API would allow data to reside together, while the standard map API would apply per-key locking. We don't have a timeline for this yet, but we want to get as much feedback on the topic as possible so that we can evaluate the options. Cheers, [1] https://issues.jboss.org/browse/ISPN-3901 [2] http://infinispan.org/docs/6.0.x/user_guide/user_guide.html#_the_grouping... -- Galder Zamarreño galder(a)redhat.com twitter.com/galderz Project Lead, Escalante http://escalante.io Engineer, Infinispan http://infinispan.org

11 years, 7 months

8
13
0 / 0

Wildfly's build/lib.xml behaves unexpectedly with JDK8

by Galder Zamarreño

Hi, In JDK8, [1] causes issues, since the replace only happens the first time the character is found. We use this lib.xml in Infinispan as well [2]. I’ve workaround it by doing this instead: name = name.split(".").join("/"); This seems to work fine, but have not fully tested it. Cheers, [1] https://github.com/wildfly/wildfly/blob/master/build/lib.xml#L75 [2] https://issues.jboss.org/browse/ISPN-3974?focusedCommentId=12942643&page=... -- Galder Zamarreño galder(a)redhat.com twitter.com/galderz Project Lead, Escalante http://escalante.io Engineer, Infinispan http://infinispan.org

11 years, 7 months

2
2
0 / 0

7.0.0.Alpha1

by Mircea Markus

Hey guys, I think we have enough stuff to cut a 7.0.0.Alpha1 next week. Besides quite some fixes that came in, we have: - Vladimir's parallel map reduce (ISPN-2284) - Tristan's autorisation for embedded mode (ISPN-3909) - Will's clustered listeners (ISPN-3355) Let's aim for Thu 20 Feb. Next in charge with releasing is Dan (release rotation is defined in the release doc now). Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

11 years, 8 months

2
2
0 / 0

infinispan-bom vs. infinispan-parent dependencies

by Martin Gencur

Hi, there are currently two Maven pom files in Infinispan where dependency versions are defined - infinispan-bom and infinispan-parent. For instance, version.protostream is defined in the BOM while version.commons.pool is defined in infinispan-parent. This causes me troubles when I want to do filtering with maven-resources-plugin and substitute versions of dependencies in certain configuration file because properties defined in the BOM are not visible to other modules (I'm currently trying to generate "features" file for HotRod to be easily deployable into Karaf - https://issues.jboss.org/browse/ISPN-3967, and I can't really access versions of some dependencies) We include the BOM file in infinispan-parent as a dependency with scope "import" which causes the properties defined in the BOM to be lost. Questions: Is there a reason why we include it as a dependency and do not have it as a parent of infinispan-parent? (as suggested in [1]) Can someone explain the reason why we have version declarations in two separate files? If you possibly know how to access properties in the BOM, please advise. To me it seems impossible without some nasty hacks. Thanks, Martin [1] http://maven.apache.org/guides/introduction/introduction-to-dependency-me...

11 years, 8 months

2
1
0 / 0

New Cache Entry Notifications

by William Burns

Hello all, I have been working with notifications and most recently I have come to look into events generated when a new entry is created. Now normally I would just expect a CacheEntryCreatedEvent to be raised. However we currently raise a CacheEntryModifiedEvent event and then a CacheEntryCreatedEvent. I notice that there are comments around the code saying that tests require both to be fired. I am wondering if anyone has an objection to only raising a CacheEntryCreatedEvent on a new cache entry being created. Does anyone know why we raise both currently? Was it just so the PutKeyValueCommand could more ignorantly just raise the CacheEntryModified pre Event? Any input would be appreciated, Thanks. - Will

11 years, 8 months

4
10
0 / 0

L1OnRehash Discussion

by William Burns

Hello everyone, I wanted to discuss what I would say as dubious benefit of L1OnRehash especially compared to the benefits it provide. L1OnRehash is used to retain a value by moving a previously owned value into the L1 when a rehash occurs and this node no longer owns that value Also any current L1 values are removed when a rehash occurs. Therefore it can only save a single remote get for only a few keys when a rehash occurs. This by itself is fine however L1OnRehash has many edge cases to guarantee consistency as can be seen from https://issues.jboss.org/browse/ISPN-3838. This can get quite complicated for a feature that gives marginal performance increases (especially given that this value may never have been read recently - at least normal L1 usage guarantees this). My first suggestion is instead to deprecate the L1OnRehash configuration option and to remove this logic. My second suggestion is a new implementation of L1OnRehash that is always enabled when L1 threshold is configured to 0. For those not familiar L1 threshold controls whether invalidations are broadcasted instead of individual messages. A value of 0 means to always broadcast. This would allow for some benefits that we can't currently do: 1. L1 values would never have to be invalidated on a rehash event (guarantee locality reads under rehash) 2. L1 requestors would not have to be tracked any longer However every write would be required to send an invalidation which could slow write performance in additional cases (since we currently only send invalidations when requestors are found). The difference would be lessened with udp, which is the transport I would assume someone would use when configuring L1 threshold to 0. What do you guys think? I am thinking that no one minds the removal of L1OnRehash that we have currently (if so let me know). I am quite curious what others think about the changes for L1 threshold value of 0, maybe this configuration value is never used? Thanks, - Will

11 years, 8 months

5
5
0 / 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

infinispan-dev February 2014