Alongside recent talk of integrating infinispan with hadoop batch
processing, there has been some discussion of using the data grid
alongside an event stream processing system.
There are several directions we could consider here. In approximate
order of increasing complexity these are:
- Allow bi-directional flow of events, such that listeners on the cache
can be used to cause events in the processing engine, or events in the
processing engine can update the cache.
- Allow the cache to be used to hold lookup data for reference from user
code running the processing engine, to speed up joining streamed events
to what would otherwise be data tables on disk.
- Integrate with the processing engine itself, such that infinispan can
be used to store items that would otherwise occupy precious RAM. This
one is probably only viable with the cooperation of the stream
processing system, so I'll base further discussion on Drools Fusion.
The engine uses memory for a) rules, i.e. processing logic. Some of this
is infrequently accessed. Think of a decision tree in which some
branches are traversed more than others. So, opportunities to swap bits
out to cache perhaps. b) state, particularly sliding windows. Again
some data is infrequently accessed. For many sliding window calculations
in particular (e.g. running average), only the head and tail of the
window are actually used. The events in-between can be swapped out.
Of course these integrations require the stream processing engine to be
written to support such operations - careful handling of object
references is needed. Currently the engine doesn't work that way -
everything is focussed on speed at the expense of memory.
- Borrow some ideas from the event processing DSLs, such that the data
grid query engine can independently support continuous (standing)
queries rather than just one-off queries. Arguably this is reinventing
the wheel, but for simple use cases it may be preferable to run the
stream processing logic directly in the grid rather than deploying a
dedicated event stream processing system. I think it's probably going to
require supporting lists as a first class construct alongside maps
though. There are various cludges possible here, including the brute
force approach of faking continuous query by re-executing a one-off
query on each mutation, but they tend to be inefficient. There is also
the thorny problem of supporting a (potentially distributed) clock,
since a lot of use cases need to reference the passage of time in the
query e.g. 'send event to listener if avg in last N minutes > x'.
Jonathan Halliday
Core developer, JBoss.
--
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham (USA), Paul Hickey (Ireland), Matt Parson
(USA), Charlie Peters (USA)