event processing integration
by Jonathan Halliday
Alongside recent talk of integrating infinispan with hadoop batch
processing, there has been some discussion of using the data grid
alongside an event stream processing system.
There are several directions we could consider here. In approximate
order of increasing complexity these are:
- Allow bi-directional flow of events, such that listeners on the cache
can be used to cause events in the processing engine, or events in the
processing engine can update the cache.
- Allow the cache to be used to hold lookup data for reference from user
code running the processing engine, to speed up joining streamed events
to what would otherwise be data tables on disk.
- Integrate with the processing engine itself, such that infinispan can
be used to store items that would otherwise occupy precious RAM. This
one is probably only viable with the cooperation of the stream
processing system, so I'll base further discussion on Drools Fusion.
The engine uses memory for a) rules, i.e. processing logic. Some of this
is infrequently accessed. Think of a decision tree in which some
branches are traversed more than others. So, opportunities to swap bits
out to cache perhaps. b) state, particularly sliding windows. Again
some data is infrequently accessed. For many sliding window calculations
in particular (e.g. running average), only the head and tail of the
window are actually used. The events in-between can be swapped out.
Of course these integrations require the stream processing engine to be
written to support such operations - careful handling of object
references is needed. Currently the engine doesn't work that way -
everything is focussed on speed at the expense of memory.
- Borrow some ideas from the event processing DSLs, such that the data
grid query engine can independently support continuous (standing)
queries rather than just one-off queries. Arguably this is reinventing
the wheel, but for simple use cases it may be preferable to run the
stream processing logic directly in the grid rather than deploying a
dedicated event stream processing system. I think it's probably going to
require supporting lists as a first class construct alongside maps
though. There are various cludges possible here, including the brute
force approach of faking continuous query by re-executing a one-off
query on each mutation, but they tend to be inefficient. There is also
the thorny problem of supporting a (potentially distributed) clock,
since a lot of use cases need to reference the passage of time in the
query e.g. 'send event to listener if avg in last N minutes > x'.
Jonathan Halliday
Core developer, JBoss.
--
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham (USA), Paul Hickey (Ireland), Matt Parson
(USA), Charlie Peters (USA)
10 years, 8 months
Infinispan - Hadoop integration
by Mircea Markus
Hi,
I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we have a prototype in place, one of the first tasks will be to validate this speed assumptions).
In previous discussions we explored the idea of providing an HDFS implementation for Infinispan, which whilst doable, might not be the best integration point:
- in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and OutputFormat[2]
- it's the specific InputFormat and OutputForman implementations that work on top of HDFS
- instead of implementing HDFS, we could provide implementations for the InputFormat and OutputFormat interfaces, which would give us more flexibility
- this seems to be the preferred integration point for other systems, such as Cassandra
- also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed.
- Assumptions that we'll need to validate: this approach doesn't impose any constraint on how data is stored in Infinispan and should allow data to be read through the Map interface. Also InputFormat and OutputFormat implementations would only use get(k) and keySet() methods, and no native Infinispan M/R, which means that C/S access should also be possible.
- very important: hadoop HDFS is an append only file system, and the M/R tasks operate on a snapshot of the data. From a task's perspective, all the data in the storage doesn't change after the task is started. More data can be appended whilst the task runs, but this won't be visible by the task. Infinispan doesn't have such an append structure, nor MVCC. The closest thing we have is the Snapshot isolation transactions implemented by the cloudTM project (this is not integrated yet). I assume that the M/R tasks are built with this snapshot-issolation requirement from the storage - this is something that we should investigate as well. It is possible that, in the first stages of this integration, we would require data stored in ISPM to be read only.
[1] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inpu...
[2] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Outp...
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
10 years, 8 months
Learning ISPN7 DataContainer internals ... first steps
by cotton-ben
Hi Mircea and RedHat,
Dmitry, and I are now taking initial steps to code the integration of
OpenHFT SHM as an off-heap ISPN7 DataContainer.
We have mused that a possible approach to this may be to use the symmetry of
ISPN7's existing DefaultDataContainer.java impl and (copy a/o extend) that
existing work into a new
org.infinispan.offheap.OffHeapDefaultDataContainer.java impl.
The key steps would be for us to 100% soundly and 100% completely replace
ConcurrentMap entries = CollectionFactory.makeConcurrentParallelMap(128,
concurrencyLevel);
with
ConcurrentMap entries = entries = new
net.openhft.collections.SharedHashMapBuilder()
.generatedValueType(Boolean.TRUE)
.entrySize(512)
.create(
new
File("/dev/shm/offHeapSharedHashMap.DataContainer"),
Object.class,
InternalCacheEntry.class
);
We are of course very newbie wrt to ISPN7 DataContainer internals. Before
we get into building and testing compelling exercises and hardening the
OffHeapDefaultDataContainer, would you please comment wrt to your view that
this seems like the correct first steps?
https://github.com/Cotton-Ben/infinispan/blob/master/off-heap/src/main/ja...
--
View this message in context: http://infinispan-developer-list.980875.n3.nabble.com/Learning-ISPN7-Data...
Sent from the Infinispan Developer List mailing list archive at Nabble.com.
10 years, 8 months
Infinispan Query API module
by Bilgin Ibryam
Hi all,
I was working on extending camel-infinispan component with remote query
capability and just realized that
org.infinispan/infinispan-query/6.0.1.Final depends on hibernate-hql-parser
and hibernate-hql-lucene which are still in Alpha.
Am I missing something or is there a way to no depend on alpha version of
these artifacts from a final version artifact?
Thanks,
--
Bilgin Ibryam
Apache Camel & Apache OFBiz committer
Blog: ofbizian.com
Twitter: @bibryam <https://twitter.com/bibryam>
Author of Instant Apache Camel Message Routing
http://www.amazon.com/dp/1783283475
10 years, 8 months
Re: [infinispan-dev] Design change in Infinispan Query
by Mircea Markus
On Feb 3, 2014, at 9:32 AM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:
> Sure searching for any cache is useful. What I was advocating is that if you search for more than one cache transparently, then you probably need to CRUD for more than one cache transparently as well. And this is not being discussed.
Not sure what you mean by CRUD over multiple caches? ATM one can run a TX over multiple caches, but I think there's something else you have in mind :-)
>
> I have to admit that having to add a cache name to the stored elements of the index documents makes me a bit sad.
sad because of the increased index size?
> I was already unhappy when I had to do it for class names. Renaming a cache will be a heavy operation too.
> Sanne, if we know that we don't share the semi index for different caches, can we avoid the need to store the cache name in each document?
>
> BTW, this discussion should be in the open.
+1
>
> On 31 janv. 2014, at 18:04, Adrian Nistor <anistor(a)gmail.com> wrote:
>
>> I think it conceptually makes sense to have one entity type per cache but this should be a good practice rather than an enforced constraint. It would be a bit late and difficult to add such a constraint now.
>>
>> The design change we are talking about is being able to search across caches. That can easily be implemented regardless of this. We can move the SearchManager from Cache scope to CacheManager scope. Indexes are bound to types not to caches anyway, so same-type entities from multiple caches can end up in the same index, we just need to store an extra hidden field: the name of the originating cache. This move would also allow us to share some lucene/hsearch resources.
>>
>> We can easily continue to support Search.getSearchManager(cache) so old api usages continue to work. This would return a delegating/decorating SearchManager that creates queries that are automatically restricted to the scope of the given cache.
>>
>> Piece of cake? :)
>>
>>
>>
>> On Thu, Jan 30, 2014 at 9:56 PM, Mircea Markus <mmarkus(a)redhat.com> wrote:
>> curious to see your thoughts on this: it is a recurring topic and will affects the way we design things in future in a significant way.
>> E.g. if we think (recommend) that a distinct cache should be used for each entity, then we'll need querying to work between caches. Also some cache stores can be built along these lines (e.g. for the JPA cache store we only need it to support a single entity type).
>>
>> Begin forwarded message:
>>
>> > On Jan 30, 2014, at 9:42 AM, Galder Zamarreño <galder(a)redhat.com> wrote:
>> >
>> >>
>> >> On Jan 21, 2014, at 11:52 PM, Mircea Markus <mmarkus(a)redhat.com> wrote:
>> >>
>> >>>
>> >>> On Jan 15, 2014, at 1:42 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:
>> >>>
>> >>>> By the way, people looking for that feature are also asking for a unified Cache API accessing these several caches right? Otherwise I am not fully understanding why they ask for a unified query.
>> >>>> Do you have written detailed use cases somewhere for me to better understand what is really requested?
>> >>>
>> >>> IMO from a user perspective, being able to run queries spreading several caches makes the programming simplifies the programming model: each cache corresponding to a single entity type, with potentially different configuration.
>> >>
>> >> Not sure if it simplifies things TBH if the configuration is the same. IMO, it just adds clutter.
>> >
>> > Not sure I follow: having a cache that contains both Cars and Persons sound more cluttering to me. I think it's cumbersome to write any kind of querying with an heterogenous cache, e.g. Map/Reduce tasks that need to count all the green Cars would need to be aware of Persons and ignore them. Not only it is harder to write, but discourages code reuse and makes it hard to maintain (if you'll add Pets in the same cache in future you need to update the M/R code as well). And of course there are also different cache-based configuration options that are not immediately obvious (at design time) but will be in the future (there are more Persons than Cars, they live longer/expiry etc): mixing everything together in the same cache from the begging is a design decision that might bite you in the future.
>> >
>> > The way I see it - and very curious to see your opinion on this - following an database analogy, the CacheManager corresponds to an Database and the Cache to a Table. Hence my thought that queries spreading multiple caches are both useful and needed (same as query spreading over multiple tables).
>> >
>> >
>>
>> Cheers,
>> --
>> Mircea Markus
>> Infinispan lead (www.infinispan.org)
>>
>>
>>
>>
>>
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
10 years, 8 months
Infinispan - Hadoop integration
by Mircea Markus
Hi,
I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we have a prototype in place, one of the first tasks will be to validate this speed assumptions).
In previous discussions we explored the idea of providing an HDFS implementation for Infinispan, which whilst doable, might not be the best integration point:
- in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and OutputFormat[2]
- it's the specific InputFormat and OutputForman implementations that work on top of HDFS
- instead of implementing HDFS, we could provide implementations for the InputFormat and OutputFormat interfaces, which would give us more flexibility
- this seems to be the preferred integration point for other systems, such as Cassandra
- also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed.
- Assumptions that we'll need to validate: this approach doesn't impose any constraint on how data is stored in Infinispan and should allow data to be read through the Map interface. Also InputFormat and OutputFormat implementations would only use get(k) and keySet() methods, and no native Infinispan M/R, which means that C/S access should also be possible.
- very important: hadoop HDFS is an append only file system, and the M/R tasks operate on a snapshot of the data. From a task's perspective, all the data in the storage doesn't change after the task is started. More data can be appended whilst the task runs, but this won't be visible by the task. Infinispan doesn't have such an append structure, nor MVCC. The closest thing we have is the Snapshot isolation transactions implemented by the cloudTM project (this is not integrated yet). I assume that the M/R tasks are built with this snapshot-issolation requirement from the storage - this is something that we should investigate as well. It is possible that, in the first stages of this integration, we would require data stored in ISPM to be read only.
[1] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inpu...
[2] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Outp...
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
10 years, 8 months
Proposed ISPN 7 compilation incompatibilities with ISPN 6
by William Burns
Recently while working on some ISPN 7 features, there were some public
API inconsistencies. I wanted to bring these up just in case if
someone had concerns.
The first few are pretty trivial, but can cause compilation errors
between versions if user code implements these interfaces and defines
types.
1. The CacheWriter interface currently defines a delete(K key) method.
To be more inline with JCache and java.util.collections interfaces I
was hoping to change this to be delete(Object key) instead.
2. The CacheLoader interface currently defines load(K key) and
contains(K key) methods. Similar to above I was hoping to change the
K type to be Object to be more inline with JCache and
java.util.collections interfaces.
This last one is a bit more major, but currently we have 2 classes
that are named KeyFilter. One that resides in the
org.infinispan.notifications package and another that resides in the
org.infinispan.persistence.spi.AdvancedCacheLoader interface.
3. My plan is instead to consolidate these classes into 1 into a new
core org.infinispan.filter package. I would also move the new
KeyValueFilter class that was added for cluster listeners into this
package and their accompanying implementations.
The first 2 is currently implemented as changes in
https://github.com/infinispan/infinispan/pull/2423. The latter I was
going to add into changes for
https://issues.jboss.org/browse/ISPN-4068.
Let me know what you guys think.
- Will
10 years, 8 months
Never push with --force
by Sanne Grinovero
Yesterday I pushed a fix from Dan upstream, and this morning the fix
wasn't there anymore. Some unrelated fix was merged in the meantime.
I only realized this because I was updating my personal origin and git
wouldn't allow me to push the non-fast-forward branch, so in a sense I
could detect it because of how our workflow works (good).
I have no idea of how it happened, but I guess it won't hurt to remind
that we should never push with --force, at least not without warning
the whole list.
I now cherry-picked and fixed master by re-pushing the missing patch,
so nothing bad happening :-)
Sanne
10 years, 8 months
Infinispan HotRod C# Client 7.0.0.Alpha1
by Ion Savin
Hi all,
Infinispan HotRod C# Client 7.0.0.Alpha1 is now available.
This new version is a C# wrapper over the native client and brings
support for L2 and L3 client intelligence levels in addition to L1. As
more features are added to the native client they will make their way
into the C# client as well.
You can find the the .msi installer on the download page [1] and the
source code on GitHub [2].
Please give it a try and let us know what you think.
[1] http://infinispan.org/hotrod-clients/
[2] https://github.com/infinispan/dotnet-client
Regards,
Ion Savin
10 years, 8 months