[infinispan-dev] Hadoop and ISPN first and next steps

Monday, 23 June 2014

Hi all,

Last week Pedro, myself and Mircea met at London to start prototyping the integration
between Hadoop and ISPN. 

We discussed several scenarios where Hadoop and ISPN would be able to work together, and
decided to start with ISPN server as the source and/or sink for a Hadoop Map Reduce job

After creating an InputFormat and OutputFormat for ISPN [1], we generated some data [2]
and run a sample job [3] using Hadoop v1.x, both in docker [4] and on a 4 node physical
cluster (installed with the help of puppet [5]) 

We also run the same job in the same cluster with the same data, but using HDFS as data
source and sink, so that we could verify correctness.

In this setup, each Hadoop slave runs the TaskTracker, Data node and ISPN server, and the
idea was to generate a split [6] based on segments and redirect the map task to be
executed on the nodes associated with those segments. This routing and filtering the data
is still work in progress, carried on by Pedro.

Next steps? 

- For sure optimise the current Input/OutputFormat so that it can efficiently read/write
data. This will allow ISPN to become part of the Hadoop ecosystem and easier to integrate
it with tools like Apache Hive [7] or Pig [8].  
- Investigate closer integration for Map Reduce, potentially usable in library mode. As
you might know, YARN (the overhaul of Hadoop architecture) is not only about Map Reduce,
and it offers more extensions points than Hadoop Map Reduce v1
- I read with great interest the Spark paper [9]. Spark provides a DSL with functional
language constructs like map, flatMap and filter to process distributed data in memory. In
this scenario, Map Reduce is just a special case achieved by chaining functions [10]. As
Spark is much more than Map Reduce, and can run many machine learning algorithms
efficiently, I was wondering if we should shift attention to Spark rather than focusing
too much on Map Reduce. Thoughts?

[1]
https://github.com/pruivo/infinispan-hadoop-integration/tree/master/src/m...
[2]
http://www.skorks.com/2010/03/how-to-quickly-generate-a-large-file-on-the...
[3]
https://github.com/pruivo/hadoop-wordcount-example/tree/master/src/main/j...
[4] https://github.com/gustavonalle/docker/tree/master/hadoop
[5] https://gist.github.com/gustavonalle/95dfdd771f31e1e2bf9d
[6] https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/Input...
[7] https://hive.apache.org/
[8] http://pig.apache.org/
[9] http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
[10] https://spark.apache.org/docs/0.9.0/quick-start.html#more-on-rdd-operations

Cheers,
Gustavo

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-dev] Hadoop and ISPN first and next steps