Hi,
I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the
integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run
Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we
have a prototype in place, one of the first tasks will be to validate this speed
assumptions).
In previous discussions we explored the idea of providing an HDFS implementation for
Infinispan, which whilst doable, might not be the best integration point:
- in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and
OutputFormat[2]
- it's the specific InputFormat and OutputForman implementations that work on top of
HDFS
- instead of implementing HDFS, we could provide implementations for the InputFormat and
OutputFormat interfaces, which would give us more flexibility
- this seems to be the preferred integration point for other systems, such as Cassandra
- also important to notice that we will have both an Hadoop and an Infinispan cluster
running in parallel: the user will interact with the former in order to run M/R tasks.
Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in
order to get the data to be processed.
- Assumptions that we'll need to validate: this approach doesn't impose any
constraint on how data is stored in Infinispan and should allow data to be read through
the Map interface. Also InputFormat and OutputFormat implementations would only use get(k)
and keySet() methods, and no native Infinispan M/R, which means that C/S access should
also be possible.
- very important: hadoop HDFS is an append only file system, and the M/R tasks operate on
a snapshot of the data. From a task's perspective, all the data in the storage
doesn't change after the task is started. More data can be appended whilst the task
runs, but this won't be visible by the task. Infinispan doesn't have such an
append structure, nor MVCC. The closest thing we have is the Snapshot isolation
transactions implemented by the cloudTM project (this is not integrated yet). I assume
that the M/R tasks are built with this snapshot-issolation requirement from the storage -
this is something that we should investigate as well. It is possible that, in the first
stages of this integration, we would require data stored in ISPM to be read only.
[1]
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inpu...
[2]
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Outp...
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)