Re: [infinispan-dev] MapReduce limitations and suggestions.
by Evangelos Vazaios
On 02/17/2014 10:42 AM, infinispan-dev-request(a)lists.jboss.org wrote:
> Hi Etienne
>
> I was going to suggest using a combiner - the combiner would process the
> mapper results from just one node, so you should need at most double the
> memory on that node. I guess we could reduce the memory requirements even
> more if the combiner could run concurrently with the mapper... Vladimir,
> does it sound like a reasonable feature request?
>
There are algorithms where combiners cannot be applied.
> I'm afraid in your situation using a cache store wouldn't help, as the
> intermediate values for the same key are stored as a list in a single
> entry. So if all cars are red, there would be just one intermediate key in
> the intermediate cache, and there would be nothing to evict to the cache
> store. Vladimir, do you think we could somehow "chunk" the intermediary
> values into multiple entries grouped by the intermediary key, to support
> this scenario?
>
I was thinking a custom cache implementation that maintains the overall
size of cache and each key individually and when a threshold is reached
it spills things on disk. Note that I am not familiar with the internals
of Infinispan, but I think it is doable. Such a cache solves the problem
in both cases (when one key is too large to be in memory as my example
and the case where the keys assigned to one reducer exceeds its memory).
> For reference, though, a limited version of what you're asking for is
> already available. You can change the configuration of the intermediary
> cache by defining a "__tmpMapReduce" cache in your configuration. That
> configuration will be used for all M/R tasks, whether they use the shared
> intermediate cache or they create their own.
>
I have one question about this. if I start two MR tasks at once will
these tasks use the same Cache? thus, the intermediate results are going
to be mixed?. This cache can be used in order as a test case.
Regards,
Evangelos
> Cheers
> Dan
>
>
>
> On Mon, Feb 17, 2014 at 10:18 AM, Etienne Riviere
> <etienne.riviere(a)unine.ch>wrote:
>
>> > Hi Radim,
>> >
>> > I might misunderstand your suggestion but many M/R jobs actually require
>> > to run the two phases one after the other, and henceforth to store the
>> > intermediate results somewhere. While some may slightly reduce intermediate
>> > memory usage by using a combiner function (e.g., the word-count example), I
>> > don't see how we can avoid intermediate storage altogether.
>> >
>> > Thanks,
>> > Etienne (leads project -- as Evangelos who initiated the thread)
>> >
>> > On 17 Feb 2014, at 08:48, Radim Vansa <rvansa(a)redhat.com> wrote:
>> >
>>> > > I think that the intermediate cache is not required at all. The M/R
>>> > > algorithm itself can (and should!) run with memory occupied by the
>>> > > result of reduction. The current implementation with Map first and
>>> > > Reduce after that will always have these problems, using a cache for
>>> > > temporary caching the result is only a workaround.
>>> > >
>>> > > The only situation when temporary cache could be useful is when the
>>> > > result grows linearly (or close to that or even more) with the amount of
>>> > > reduced entries. This would be the case for groupBy producing Map<Color,
>>> > > List<Entry>> from all entries in cache. Then the task does not scale and
>>> > > should be redesigned anyway, but flushing the results into cache backed
>>> > > by cache store could help.
>>> > >
>>> > > Radim
>>> > >
>>> > > On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote:
>>>> > >> Tristan,
>>>> > >>
>>>> > >> Actually they are not addressed in this pull request but the feature
>>>> > >> where custom output cache is used instead of results being returned is
>>>> > >> next in the implementation pipeline.
>>>> > >>
>>>> > >> Evangelos, indeed, depending on a reducer function all intermediate
>>>> > >> KOut/VOut pairs might be moved to a single node. How would custom cache
>>>> > >> help in this case?
>>>> > >>
>>>> > >> Regards,
>>>> > >> Vladimir
>>>> > >>
>>>> > >>
>>>> > >> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote:
>>>>> > >>> Hi Evangelos,
>>>>> > >>>
>>>>> > >>> you might be interested in looking into a current pull request which
>>>>> > >>> addresses some (all?) of these issues
>>>>> > >>>
>>>>> > >>> https://github.com/infinispan/infinispan/pull/2300
>>>>> > >>>
>>>>> > >>> Tristan
>>>>> > >>>
>>>>> > >>> On 14/02/2014 16:10, Evangelos Vazaios wrote:
>>>>>> > >>>> Hello everyone,
>>>>>> > >>>>
>>>>>> > >>>> I started using the MapReduce implementation of Infinispan and I came
>>>>>> > >>>> across some possible limitations. Thus, I want to make some
>> > suggestions
>>>>>> > >>>> about the MapReduce (MR) implementation of Infinispan.
>>>>>> > >>>> Depending on the algorithm, there might be some memory problems,
>>>>>> > >>>> especially for intermediate results.
>>>>>> > >>>> An example of such a case is group by. Suppose that we have a cluster
>>>>>> > >>>> of 2 nodes with 2 GB available. Let a distributed cache, where simple
>>>>>> > >>>> car objects (id,brand,colour) are stored and the total size of data is
>>>>>> > >>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go
>> > to
>>>>>> > >>>> only one reducer, as a result an OutOfMemoryException will be thrown.
>>>>>> > >>>>
>>>>>> > >>>> To overcome these limitations, I propose to add as parameter the name
>> > of
>>>>>> > >>>> the intermediate cache to be used. This will enable the creation of a
>>>>>> > >>>> custom configured cache that deals with the memory limitations.
>>>>>> > >>>>
>>>>>> > >>>> Another feature that I would like to have is to set the name of the
>>>>>> > >>>> output cache. The reasoning behind this is similar to the one
>> > mentioned
>>>>>> > >>>> above.
>>>>>> > >>>>
>>>>>> > >>>> I wait for your thoughts on these two suggestions.
>>>>>> > >>>>
>>>>>> > >>>> Regards,
>>>>>> > >>>> Evangelos
>>>>>> > >>>> _______________________________________________
>>>>>> > >>>> infinispan-dev mailing list
>>>>>> > >>>> infinispan-dev(a)lists.jboss.org
>>>>>> > >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>> > >>>>
>>>>>> > >>>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> _______________________________________________
>>>>> > >>> infinispan-dev mailing list
>>>>> > >>> infinispan-dev(a)lists.jboss.org
>>>>> > >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> > >> _______________________________________________
>>>> > >> infinispan-dev mailing list
>>>> > >> infinispan-dev(a)lists.jboss.org
>>>> > >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> > >
>>> > >
>>> > > --
>>> > > Radim Vansa <rvansa(a)redhat.com>
>>> > > JBoss DataGrid QA
>>> > >
>>> > > _______________________________________________
>>> > > infinispan-dev mailing list
>>> > > infinispan-dev(a)lists.jboss.org
>>> > > https://lists.jboss.org/mailman/listinfo/infinispan-d
10 years, 10 months
HotRod near caches
by Tristan Tarrant
Hi people,
this is a bit of a dump of ideas for getting our HotRod client in shape
for supporting near caches:
- RemoteCaches should have an optional internal cache. This cache should
probably be some form of bounded expiration-aware hashmap which would
serve as a local copy of data retrieved over the wire. In the past we
have advocated the use of combining an EmbeddedCacheManager with a
RemoteCacheStore to achieve this, but this is only applicable to Java
clients, while we need to think of a solution for our other clients too.
- Once remote listeners are in place, a RemoteCache would automatically
invalidate entries in the near-cache.
- Remote Query should "pass-through" the near-cache, so that entries
retrieved from a query would essentially be cached locally following the
same semantics. This can be achieved by having the QUERY verb return
just the set of matching keys instead of the whole entries
- Optionally we can even think about a query cache which would hash the
query DSL and store the resulting keys locally so that successive
invocations of a cached query wouldn't go through the wire. Matching
this with invalidation is probably a tad more complex, and I'd probably
avoid going down that path.
Tristan
10 years, 10 months
Remote Query improvements
by Tristan Tarrant
Hi everybody,
last week I developed a simple application using Remote Query, and ran
into a few issues. Some of them are just technical hurdles, while others
have to do with the complexity of the developer experience. Here they
are for open discussion:
- the schemas registry should be persistent. Alternatively being able to
either specify the ProtoBuf schema from the <indexing /> configuration
in the server subsystem or use server's deployment processor to "deploy"
schemas.
- the server should store the single protobuf source schemas to allow
for easy inspection/update of each using our management tools. The
server itself should then compile the protobuf schemas into the binary
representation when any of the source schemas changes. This would
require a Java implementation of the ProtoBuf schema compiler, which
wouldn't probably be too hard to do with Antlr.
- we need to be able to annotate single protobuf fields for indexing
(probably by using specially-formatted comments, a la doclets) to avoid
indexing all of the fields
- since remote query is already imbued with JPA in some form, an
interesting project would be to implement a JPA annotation processor
which can produce a set of ProtoBuf schemas from JPA-annotated classes.
- on top of the above, a ProtoBuf marshaller/unmarshaller which can use
the JPA entities directly.
Tristan
10 years, 10 months
Re: [infinispan-dev] UI-Portlet-Plugins
by Tristan Tarrant
Hi Heiko,
adding infinispan-dev.
Thanks for taking the time to investigate this. One of the things that
would need to be "exposed" to such portlets is the ability to link to
RHQ views/portlets (e.g. go to a specific service view) so that
"drilling-down" would show the appropriate detailed node.
Additionally we would like to provide RHQ-specific configuration when
installing our "server" plugin, such as cache/containers dynagroups,
maybe even a custom initial dashboard. Can it be done ?
Tristan
On 02/08/2014 07:43 PM, Heiko W.Rupp wrote:
> Hey,
>
> after talking with Tristan Tarrant from Infinispan I got the idea, that we could create a generic Portlet, that
> gets its content data as HTML from a server plugin. The server plugin then has access to all the server logic
> to do its task and can e.g. compute various stats of an Infinispan cluster.
>
> The following drawing illustrates that idea:
>
>
>
>
> Instances of the portlet will call to the selected server plugin and invoking a well known "interface" like "getMessage".
> This message will then do the processing and return a HTML-snippet (not a full page), which is then displayed
> inside the portlet window.
>
> Attached are two screen shots from such a portlet + some PoC code.
>
>
>
>
> This is created in the backend via (abbreviated)
>
> complexResults.put(new PropertySimple("results", "<h1>Hello World</h1>Welcome to RHQ<br/>Have FUN<br/>Current date: " + date));
>
> This is the "generic" config screen:
>
>
>
>
> The drop down shows the list of plugins available.
>
> In this PoC, the plugin writer is responsible for creating sane HTML,
> if we decided to put that into RHQ, we may want to do some additional
> sanitation. I also have no idea about styling the inner content.
>
> While this is probably not the way for the (long term) future, at least
> the backend plugins can be re-used if we move to an Angular-based UI,
> so this investment would not be lost.
>
> Heiko
>
>
>
>
>
10 years, 10 months
7.0.0.Alpha1
by Mircea Markus
Hey guys,
I think we have enough stuff to cut a 7.0.0.Alpha1 next week. Besides quite some fixes that came in, we have:
- Vladimir's parallel map reduce (ISPN-2284)
- Tristan's autorisation for embedded mode (ISPN-3909)
- Will's clustered listeners (ISPN-3355)
Let's aim for Thu 20 Feb. Next in charge with releasing is Dan (release rotation is defined in the release doc now).
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
10 years, 11 months
infinispan-bom vs. infinispan-parent dependencies
by Martin Gencur
Hi,
there are currently two Maven pom files in Infinispan where dependency
versions are defined - infinispan-bom and infinispan-parent. For
instance, version.protostream is defined in the BOM while
version.commons.pool is defined in infinispan-parent.
This causes me troubles when I want to do filtering with
maven-resources-plugin and substitute versions of dependencies in
certain configuration file because properties defined in the BOM are not
visible to other modules (I'm currently trying to generate "features"
file for HotRod to be easily deployable into Karaf -
https://issues.jboss.org/browse/ISPN-3967, and I can't really access
versions of some dependencies)
We include the BOM file in infinispan-parent as a dependency with scope
"import" which causes the properties defined in the BOM to be lost.
Questions:
Is there a reason why we include it as a dependency and do not have it
as a parent of infinispan-parent? (as suggested in [1])
Can someone explain the reason why we have version declarations in two
separate files?
If you possibly know how to access properties in the BOM, please advise.
To me it seems impossible without some nasty hacks.
Thanks,
Martin
[1]
http://maven.apache.org/guides/introduction/introduction-to-dependency-me...
10 years, 11 months
New Cache Entry Notifications
by William Burns
Hello all,
I have been working with notifications and most recently I have come
to look into events generated when a new entry is created. Now
normally I would just expect a CacheEntryCreatedEvent to be raised.
However we currently raise a CacheEntryModifiedEvent event and then a
CacheEntryCreatedEvent. I notice that there are comments around the
code saying that tests require both to be fired.
I am wondering if anyone has an objection to only raising a
CacheEntryCreatedEvent on a new cache entry being created. Does
anyone know why we raise both currently? Was it just so the
PutKeyValueCommand could more ignorantly just raise the
CacheEntryModified pre Event?
Any input would be appreciated, Thanks.
- Will
10 years, 11 months
L1OnRehash Discussion
by William Burns
Hello everyone,
I wanted to discuss what I would say as dubious benefit of L1OnRehash
especially compared to the benefits it provide.
L1OnRehash is used to retain a value by moving a previously owned
value into the L1 when a rehash occurs and this node no longer owns
that value Also any current L1 values are removed when a rehash
occurs. Therefore it can only save a single remote get for only a few
keys when a rehash occurs.
This by itself is fine however L1OnRehash has many edge cases to
guarantee consistency as can be seen from
https://issues.jboss.org/browse/ISPN-3838. This can get quite
complicated for a feature that gives marginal performance increases
(especially given that this value may never have been read recently -
at least normal L1 usage guarantees this).
My first suggestion is instead to deprecate the L1OnRehash
configuration option and to remove this logic.
My second suggestion is a new implementation of L1OnRehash that is
always enabled when L1 threshold is configured to 0. For those not
familiar L1 threshold controls whether invalidations are broadcasted
instead of individual messages. A value of 0 means to always
broadcast. This would allow for some benefits that we can't currently
do:
1. L1 values would never have to be invalidated on a rehash event
(guarantee locality reads under rehash)
2. L1 requestors would not have to be tracked any longer
However every write would be required to send an invalidation which
could slow write performance in additional cases (since we currently
only send invalidations when requestors are found). The difference
would be lessened with udp, which is the transport I would assume
someone would use when configuring L1 threshold to 0.
What do you guys think? I am thinking that no one minds the removal
of L1OnRehash that we have currently (if so let me know). I am quite
curious what others think about the changes for L1 threshold value of
0, maybe this configuration value is never used?
Thanks,
- Will
10 years, 11 months