March 2014 - infinispan-dev - Jboss List Archives

by Jonathan Halliday

Alongside recent talk of integrating infinispan with hadoop batch processing, there has been some discussion of using the data grid alongside an event stream processing system. There are several directions we could consider here. In approximate order of increasing complexity these are: - Allow bi-directional flow of events, such that listeners on the cache can be used to cause events in the processing engine, or events in the processing engine can update the cache. - Allow the cache to be used to hold lookup data for reference from user code running the processing engine, to speed up joining streamed events to what would otherwise be data tables on disk. - Integrate with the processing engine itself, such that infinispan can be used to store items that would otherwise occupy precious RAM. This one is probably only viable with the cooperation of the stream processing system, so I'll base further discussion on Drools Fusion. The engine uses memory for a) rules, i.e. processing logic. Some of this is infrequently accessed. Think of a decision tree in which some branches are traversed more than others. So, opportunities to swap bits out to cache perhaps. b) state, particularly sliding windows. Again some data is infrequently accessed. For many sliding window calculations in particular (e.g. running average), only the head and tail of the window are actually used. The events in-between can be swapped out. Of course these integrations require the stream processing engine to be written to support such operations - careful handling of object references is needed. Currently the engine doesn't work that way - everything is focussed on speed at the expense of memory. - Borrow some ideas from the event processing DSLs, such that the data grid query engine can independently support continuous (standing) queries rather than just one-off queries. Arguably this is reinventing the wheel, but for simple use cases it may be preferable to run the stream processing logic directly in the grid rather than deploying a dedicated event stream processing system. I think it's probably going to require supporting lists as a first class construct alongside maps though. There are various cludges possible here, including the brute force approach of faking continuous query by re-executing a one-off query on each mutation, but they tend to be inefficient. There is also the thorny problem of supporting a (potentially distributed) clock, since a lot of use cases need to reference the passage of time in the query e.g. 'send event to listener if avg in last N minutes > x'. Jonathan Halliday Core developer, JBoss. -- Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham (USA), Paul Hickey (Ireland), Matt Parson (USA), Charlie Peters (USA)

10 years, 9 months

2
1
0 / 0

Infinispan - Hadoop integration

by Mircea Markus

Hi, I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we have a prototype in place, one of the first tasks will be to validate this speed assumptions). In previous discussions we explored the idea of providing an HDFS implementation for Infinispan, which whilst doable, might not be the best integration point: - in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and OutputFormat[2] - it's the specific InputFormat and OutputForman implementations that work on top of HDFS - instead of implementing HDFS, we could provide implementations for the InputFormat and OutputFormat interfaces, which would give us more flexibility - this seems to be the preferred integration point for other systems, such as Cassandra - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. - Assumptions that we'll need to validate: this approach doesn't impose any constraint on how data is stored in Infinispan and should allow data to be read through the Map interface. Also InputFormat and OutputFormat implementations would only use get(k) and keySet() methods, and no native Infinispan M/R, which means that C/S access should also be possible. - very important: hadoop HDFS is an append only file system, and the M/R tasks operate on a snapshot of the data. From a task's perspective, all the data in the storage doesn't change after the task is started. More data can be appended whilst the task runs, but this won't be visible by the task. Infinispan doesn't have such an append structure, nor MVCC. The closest thing we have is the Snapshot isolation transactions implemented by the cloudTM project (this is not integrated yet). I assume that the M/R tasks are built with this snapshot-issolation requirement from the storage - this is something that we should investigate as well. It is possible that, in the first stages of this integration, we would require data stored in ISPM to be read only. [1] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inpu... [2] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Outp... Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

10 years, 9 months

6
13
0 / 0

Learning ISPN7 DataContainer internals ... first steps

by cotton-ben

Hi Mircea and RedHat, Dmitry, and I are now taking initial steps to code the integration of OpenHFT SHM as an off-heap ISPN7 DataContainer. We have mused that a possible approach to this may be to use the symmetry of ISPN7's existing DefaultDataContainer.java impl and (copy a/o extend) that existing work into a new org.infinispan.offheap.OffHeapDefaultDataContainer.java impl. The key steps would be for us to 100% soundly and 100% completely replace ConcurrentMap entries = CollectionFactory.makeConcurrentParallelMap(128, concurrencyLevel); with ConcurrentMap entries = entries = new net.openhft.collections.SharedHashMapBuilder() .generatedValueType(Boolean.TRUE) .entrySize(512) .create( new File("/dev/shm/offHeapSharedHashMap.DataContainer"), Object.class, InternalCacheEntry.class ); We are of course very newbie wrt to ISPN7 DataContainer internals. Before we get into building and testing compelling exercises and hardening the OffHeapDefaultDataContainer, would you please comment wrt to your view that this seems like the correct first steps? https://github.com/Cotton-Ben/infinispan/blob/master/off-heap/src/main/ja... -- View this message in context: http://infinispan-developer-list.980875.n3.nabble.com/Learning-ISPN7-Data... Sent from the Infinispan Developer List mailing list archive at Nabble.com.

10 years, 9 months

2
4
0 / 0

Infinispan Query API module

by Bilgin Ibryam

Hi all, I was working on extending camel-infinispan component with remote query capability and just realized that org.infinispan/infinispan-query/6.0.1.Final depends on hibernate-hql-parser and hibernate-hql-lucene which are still in Alpha. Am I missing something or is there a way to no depend on alpha version of these artifacts from a final version artifact? Thanks, -- Bilgin Ibryam Apache Camel & Apache OFBiz committer Blog: ofbizian.com Twitter: @bibryam <https://twitter.com/bibryam> Author of Instant Apache Camel Message Routing http://www.amazon.com/dp/1783283475

10 years, 9 months

2
1
0 / 0

OpenJDK and HashMap …. Safely Teaching an Old Dog New (Off-Heap!) Tricks

by Sanne Grinovero

http://www.infoq.com/articles/Open-JDK-and-HashMap-Off-Heap Great Article!

10 years, 9 months

2
1
0 / 0

Re: [infinispan-dev] Design change in Infinispan Query

by Mircea Markus

On Feb 3, 2014, at 9:32 AM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > Sure searching for any cache is useful. What I was advocating is that if you search for more than one cache transparently, then you probably need to CRUD for more than one cache transparently as well. And this is not being discussed. Not sure what you mean by CRUD over multiple caches? ATM one can run a TX over multiple caches, but I think there's something else you have in mind :-) > > I have to admit that having to add a cache name to the stored elements of the index documents makes me a bit sad. sad because of the increased index size? > I was already unhappy when I had to do it for class names. Renaming a cache will be a heavy operation too. > Sanne, if we know that we don't share the semi index for different caches, can we avoid the need to store the cache name in each document? > > BTW, this discussion should be in the open. +1 > > On 31 janv. 2014, at 18:04, Adrian Nistor <anistor(a)gmail.com> wrote: > >> I think it conceptually makes sense to have one entity type per cache but this should be a good practice rather than an enforced constraint. It would be a bit late and difficult to add such a constraint now. >> >> The design change we are talking about is being able to search across caches. That can easily be implemented regardless of this. We can move the SearchManager from Cache scope to CacheManager scope. Indexes are bound to types not to caches anyway, so same-type entities from multiple caches can end up in the same index, we just need to store an extra hidden field: the name of the originating cache. This move would also allow us to share some lucene/hsearch resources. >> >> We can easily continue to support Search.getSearchManager(cache) so old api usages continue to work. This would return a delegating/decorating SearchManager that creates queries that are automatically restricted to the scope of the given cache. >> >> Piece of cake? :) >> >> >> >> On Thu, Jan 30, 2014 at 9:56 PM, Mircea Markus <mmarkus(a)redhat.com> wrote: >> curious to see your thoughts on this: it is a recurring topic and will affects the way we design things in future in a significant way. >> E.g. if we think (recommend) that a distinct cache should be used for each entity, then we'll need querying to work between caches. Also some cache stores can be built along these lines (e.g. for the JPA cache store we only need it to support a single entity type). >> >> Begin forwarded message: >> >> > On Jan 30, 2014, at 9:42 AM, Galder Zamarreño <galder(a)redhat.com> wrote: >> > >> >> >> >> On Jan 21, 2014, at 11:52 PM, Mircea Markus <mmarkus(a)redhat.com> wrote: >> >> >> >>> >> >>> On Jan 15, 2014, at 1:42 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: >> >>> >> >>>> By the way, people looking for that feature are also asking for a unified Cache API accessing these several caches right? Otherwise I am not fully understanding why they ask for a unified query. >> >>>> Do you have written detailed use cases somewhere for me to better understand what is really requested? >> >>> >> >>> IMO from a user perspective, being able to run queries spreading several caches makes the programming simplifies the programming model: each cache corresponding to a single entity type, with potentially different configuration. >> >> >> >> Not sure if it simplifies things TBH if the configuration is the same. IMO, it just adds clutter. >> > >> > Not sure I follow: having a cache that contains both Cars and Persons sound more cluttering to me. I think it's cumbersome to write any kind of querying with an heterogenous cache, e.g. Map/Reduce tasks that need to count all the green Cars would need to be aware of Persons and ignore them. Not only it is harder to write, but discourages code reuse and makes it hard to maintain (if you'll add Pets in the same cache in future you need to update the M/R code as well). And of course there are also different cache-based configuration options that are not immediately obvious (at design time) but will be in the future (there are more Persons than Cars, they live longer/expiry etc): mixing everything together in the same cache from the begging is a design decision that might bite you in the future. >> > >> > The way I see it - and very curious to see your opinion on this - following an database analogy, the CacheManager corresponds to an Database and the Cache to a Table. Hence my thought that queries spreading multiple caches are both useful and needed (same as query spreading over multiple tables). >> > >> > >> >> Cheers, >> -- >> Mircea Markus >> Infinispan lead (www.infinispan.org) >> >> >> >> >> Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

10 years, 9 months

9
41
0 / 0

Infinispan - Hadoop integration

by Mircea Markus

Hi, I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we have a prototype in place, one of the first tasks will be to validate this speed assumptions). In previous discussions we explored the idea of providing an HDFS implementation for Infinispan, which whilst doable, might not be the best integration point: - in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and OutputFormat[2] - it's the specific InputFormat and OutputForman implementations that work on top of HDFS - instead of implementing HDFS, we could provide implementations for the InputFormat and OutputFormat interfaces, which would give us more flexibility - this seems to be the preferred integration point for other systems, such as Cassandra - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. - Assumptions that we'll need to validate: this approach doesn't impose any constraint on how data is stored in Infinispan and should allow data to be read through the Map interface. Also InputFormat and OutputFormat implementations would only use get(k) and keySet() methods, and no native Infinispan M/R, which means that C/S access should also be possible. - very important: hadoop HDFS is an append only file system, and the M/R tasks operate on a snapshot of the data. From a task's perspective, all the data in the storage doesn't change after the task is started. More data can be appended whilst the task runs, but this won't be visible by the task. Infinispan doesn't have such an append structure, nor MVCC. The closest thing we have is the Snapshot isolation transactions implemented by the cloudTM project (this is not integrated yet). I assume that the M/R tasks are built with this snapshot-issolation requirement from the storage - this is something that we should investigate as well. It is possible that, in the first stages of this integration, we would require data stored in ISPM to be read only. [1] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inpu... [2] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Outp... Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

10 years, 9 months

1
0
0 / 0

Proposed ISPN 7 compilation incompatibilities with ISPN 6

by William Burns

Recently while working on some ISPN 7 features, there were some public API inconsistencies. I wanted to bring these up just in case if someone had concerns. The first few are pretty trivial, but can cause compilation errors between versions if user code implements these interfaces and defines types. 1. The CacheWriter interface currently defines a delete(K key) method. To be more inline with JCache and java.util.collections interfaces I was hoping to change this to be delete(Object key) instead. 2. The CacheLoader interface currently defines load(K key) and contains(K key) methods. Similar to above I was hoping to change the K type to be Object to be more inline with JCache and java.util.collections interfaces. This last one is a bit more major, but currently we have 2 classes that are named KeyFilter. One that resides in the org.infinispan.notifications package and another that resides in the org.infinispan.persistence.spi.AdvancedCacheLoader interface. 3. My plan is instead to consolidate these classes into 1 into a new core org.infinispan.filter package. I would also move the new KeyValueFilter class that was added for cluster listeners into this package and their accompanying implementations. The first 2 is currently implemented as changes in https://github.com/infinispan/infinispan/pull/2423. The latter I was going to add into changes for https://issues.jboss.org/browse/ISPN-4068. Let me know what you guys think. - Will

10 years, 9 months

6
12
0 / 0

Never push with --force

by Sanne Grinovero

Yesterday I pushed a fix from Dan upstream, and this morning the fix wasn't there anymore. Some unrelated fix was merged in the meantime. I only realized this because I was updating my personal origin and git wouldn't allow me to push the non-fast-forward branch, so in a sense I could detect it because of how our workflow works (good). I have no idea of how it happened, but I guess it won't hurt to remind that we should never push with --force, at least not without warning the whole list. I now cherry-picked and fixed master by re-pushing the missing patch, so nothing bad happening :-) Sanne

10 years, 9 months

4
4
0 / 0

Infinispan HotRod C# Client 7.0.0.Alpha1

by Ion Savin

Hi all, Infinispan HotRod C# Client 7.0.0.Alpha1 is now available. This new version is a C# wrapper over the native client and brings support for L2 and L3 client intelligence levels in addition to L1. As more features are added to the native client they will make their way into the C# client as well. You can find the the .msi installer on the download page [1] and the source code on GitHub [2]. Please give it a try and let us know what you think. [1] http://infinispan.org/hotrod-clients/ [2] https://github.com/infinispan/dotnet-client Regards, Ion Savin

10 years, 9 months

3
2
0 / 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

infinispan-dev March 2014