[infinispan-dev] Infinispan - Hadoop integration

Fri Mar 14 14:22:04 EDT 2014

Hey,

First off, I think integrating Infinispan with Hadoop using the InputFormat and OutputFormat interfaces is a really good idea, instead of using the file system interfaces. That's the approach that Gluster is using, but I don't think it's ideal for Infinispan.

> So to clarify, we will have a cluster of nodes where each node contains two
> JVM, one running an Hadoop process, one running an Infinispan process. The
> Hadoop process would only read the data from the Infinispan process in the
> same node during a normal M/R execution.

A regular non-master Hadoop node is running a DataNode daemon and a TaskTracker daemon in separate JVMs. Each Map/Reduce Task is also executed in a separate JVM. In a Hadoop 2.0, the TaskTracker daemon is replaced by a NodeManager daemon. This is the minimal number of JVM processes needed by a node, and there will be more if other services are running in the cluster. The first point I am trying to make is that there are many different JVM processes running on a Hadoop node. Then I am trying to understand which JVM process you are talking about when you say the "Hadoop process"? I *think* this is referring to the actual Map/Reduce Task, but I'm not sure. 

I would also say that trying to figure out how to integrate Infinispan with Apache Spark [1] would also be interesting. I don't know as much about it, but a lot of processing that is currently being performed in Map/Reduce will be migrated to Spark. It's certainly being billed as the replacement, or at least the future of Map/Reduce.

Thanks,
Alan

[1] https://spark.apache.org/

----- Original Message -----
> From: "Mircea Markus" <mmarkus at redhat.com>
> To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
> Sent: Friday, March 14, 2014 11:26:03 AM
> Subject: Re: [infinispan-dev] Infinispan - Hadoop integration
> 
> 
> On Mar 14, 2014, at 9:06, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> 
> > 
> > 
> >> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne at infinispan.org> wrote:
> >> 
> >>> On 13 March 2014 22:19, Mircea Markus <mmarkus at redhat.com> wrote:
> >>> 
> >>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne at infinispan.org> wrote:
> >>>> 
> >>>>> On 13 March 2014 22:05, Mircea Markus <mmarkus at redhat.com> wrote:
> >>>>> 
> >>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin at gmail.com> wrote:
> >>>>> 
> >>>>>>> - also important to notice that we will have both an Hadoop and an
> >>>>>>> Infinispan cluster running in parallel: the user will interact with
> >>>>>>> the former in order to run M/R tasks. Hadoop will use Infinispan
> >>>>>>> (integration achieved through InputFormat and OutputFormat ) in
> >>>>>>> order to get the data to be processed.
> >>>>>> 
> >>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as
> >>>>>> well -- hence 1JVM?
> >>>>> 
> >>>>> good point, ideally it should be a single VM: reduced serialization
> >>>>> cost (in vm access) and simpler architecture. That's if you're not
> >>>>> using C/S mode, of course.
> >>>> 
> >>>> ?
> >>>> Don't try confusing us again on that :-)
> >>>> I think we agreed that the job would *always* run in strict locality
> >>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
> >>>> would be connecting from somewhere else but that's unrelated.
> >>> 
> >>> we did discuss the possibility of running it over hotrod though, do you
> >>> see a problem with that?
> >> 
> >> No of course not, we discussed that. I just mean I think that needs to
> >> be clarified on the list that the Hadoop engine will always run in the
> >> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
> >> native clients, or Hadoop clients over Hot Rod) can indeed connect
> >> remotely, but it's important to clarify that the processing itself
> >> will take advantage of locality in all configurations. In other words,
> >> to clarify that the serialization cost you mention for clients is just
> >> to transfer the job definition and optionally the final processing
> >> result.
> >> 
> > 
> > Not quite. The serialization cost Mircea mentions I think is between the
> > Hadoop vm and the Infinispan vm on a single node. The serialization does
> > not require network traffic but is still shuffling data between two
> > processes basically. We could eliminate this by starting both Hadoop and
> > Infinispan from the same VM but that requires more work than necessary for
> > a prototype.
> 
> thanks for the clarification, indeed this is the serialization overhead I had
> in mind.
> 
> > 
> > So to clarify, we will have a cluster of nodes where each node contains two
> > JVM, one running an Hadoop process, one running an Infinispan process. The
> > Hadoop process would only read the data from the Infinispan process in the
> > same node during a normal M/R execution.
> 
> Cheers,
> --
> Mircea Markus
> Infinispan lead (www.infinispan.org)
> 
> 
> 
> 
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>