[infinispan-dev] Infinispan - Hadoop integration

Thu Mar 13 15:01:35 EDT 2014

Hi,

I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we have a prototype in place, one of the first tasks will be to validate this speed assumptions).

In previous discussions we explored the idea of providing an HDFS implementation for Infinispan, which whilst doable, might not be the best integration point:
- in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and OutputFormat[2]
- it's the specific InputFormat and OutputForman implementations that work on top of HDFS
- instead of implementing HDFS, we could provide implementations for the InputFormat and OutputFormat interfaces, which would give us more flexibility
- this seems to be the preferred integration point for other systems, such as Cassandra
- also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. 
- Assumptions that we'll need to validate: this approach doesn't impose any constraint on how data is stored in Infinispan and should allow data to be read through the Map interface. Also InputFormat and OutputFormat implementations would only use get(k) and keySet() methods, and no native Infinispan M/R, which means that C/S access should also be possible.
- very important: hadoop HDFS is an append only file system, and the M/R tasks operate on a snapshot of the data. From a task's perspective, all the data in the storage doesn't change after the task is started. More data can be appended whilst the task runs, but this won't be visible by the task. Infinispan doesn't have such an append structure, nor MVCC. The closest thing we have is the Snapshot isolation transactions implemented by the cloudTM project (this is not integrated yet). I assume that the M/R tasks are built with this snapshot-issolation requirement from the storage - this is something that we should investigate as well. It is possible that, in the first stages of this integration, we would require data stored in ISPM to be read only.

[1] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/InputFormat.html
[2] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/OutputFormat.html

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)