[infinispan-dev] Infinispan - Hadoop integration

Mon Mar 17 12:06:40 EDT 2014

For Map/Reduce v1 this is definitely the case: 

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Task+Execution+%26+Environment 

" The TaskTracker executes the Mapper / Reducer task as a child process in a separate jvm. " 

I believe this is also the case for Map/Reduce v2, but I haven't found a definitive reference in the docs yet. YARN is architected to split resource management and job scheduling/monitoring into different pieces, but I think task execution is the same as MRv1. 

Thanks, 
Alan 

----- Original Message -----

> From: "Emmanuel Bernard" <emmanuel at hibernate.org>
> To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
> Sent: Monday, March 17, 2014 11:31:34 AM
> Subject: Re: [infinispan-dev] Infinispan - Hadoop integration

> Got it now.
> That being said, if Alan is correct (one JVM per M/R task run per node), we
> will need to implement C/S local key and keyset lookup.

> Emmanuel

> On 14 Mar 2014, at 12:34, Sanne Grinovero < sanne at infinispan.org > wrote:

> > On 14 March 2014 09:06, Emmanuel Bernard < emmanuel at hibernate.org > wrote:
> 

> > > > On 13 mars 2014, at 23:39, Sanne Grinovero < sanne at infinispan.org >
> > > > wrote:
> > > 
> > 
> 

> > > > > On 13 March 2014 22:19, Mircea Markus < mmarkus at redhat.com > wrote:
> > > > 
> > > 
> > 
> 

> > > > > > On Mar 13, 2014, at 22:17, Sanne Grinovero < sanne at infinispan.org >
> > > > > > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > On 13 March 2014 22:05, Mircea Markus < mmarkus at redhat.com >
> > > > > > > wrote:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > On Mar 13, 2014, at 20:59, Ales Justin < ales.justin at gmail.com >
> > > > > > > wrote:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > - also important to notice that we will have both an Hadoop
> > > > > > > > > and
> > > > > > > > > an
> > > > > > > > > Infinispan
> > > > > > > > > cluster running in parallel: the user will interact with the
> > > > > > > > > former
> > > > > > > > > in
> > > > > > > > > order
> > > > > > > > > to run M/R tasks. Hadoop will use Infinispan (integration
> > > > > > > > > achieved
> > > > > > > > > through
> > > > > > > > > InputFormat and OutputFormat ) in order to get the data to be
> > > > > > > > > processed.
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > Would this be 2 JVMs, or you can trick Hadoop to start
> > > > > > > > Infinispan
> > > > > > > > as
> > > > > > > > well
> > > > > > > > --
> > > > > > > > hence 1JVM?
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > good point, ideally it should be a single VM: reduced
> > > > > > > serialization
> > > > > > > cost
> > > > > > > (in
> > > > > > > vm access) and simpler architecture. That's if you're not using
> > > > > > > C/S
> > > > > > > mode,
> > > > > > > of
> > > > > > > course.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > ?
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Don't try confusing us again on that :-)
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > I think we agreed that the job would *always* run in strict
> > > > > > locality
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > with the datacontainer (i.e. in the same JVM). Sure, an Hadoop
> > > > > > client
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > would be connecting from somewhere else but that's unrelated.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > we did discuss the possibility of running it over hotrod though, do
> > > > > you
> > > > > see
> > > > > a
> > > > > problem with that?
> > > > 
> > > 
> > 
> 

> > > > No of course not, we discussed that. I just mean I think that needs to
> > > 
> > 
> 
> > > > be clarified on the list that the Hadoop engine will always run in the
> > > 
> > 
> 
> > > > same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
> > > 
> > 
> 
> > > > native clients, or Hadoop clients over Hot Rod) can indeed connect
> > > 
> > 
> 
> > > > remotely, but it's important to clarify that the processing itself
> > > 
> > 
> 
> > > > will take advantage of locality in all configurations. In other words,
> > > 
> > 
> 
> > > > to clarify that the serialization cost you mention for clients is just
> > > 
> > 
> 
> > > > to transfer the job definition and optionally the final processing
> > > 
> > 
> 
> > > > result.
> > > 
> > 
> 

> > > Not quite. The serialization cost Mircea mentions I think is between the
> > > Hadoop vm and the Infinispan vm on a single node. The serialization does
> > > not
> > > require network traffic but is still shuffling data between two processes
> > > basically. We could eliminate this by starting both Hadoop and Infinispan
> > > from the same VM but that requires more work than necessary for a
> > > prototype.
> > 
> 

> > Ok so there was indeed confusion on terminology: I don't agree with that
> > design.
> 

> > > From an implementor's effort perspective having to setup an Hot Rod
> > 
> 

> > client rather than embedding an Infinispan node is approximately the
> 
> > same work, or slightly more as you have to start both. Also to test
> 
> > it, embedded mode it easier.
> 

> > Hot Rod is not meant to be used on the same node, especially not if
> 
> > you only want to access data in strict locality; for example it
> 
> > wouldn't be able to iterated on all keys of the current server node
> 
> > (and limiting to those keys only). I might be wrong as I'm not too
> 
> > familiar with Hot Rod, but I think it might not even be able to
> 
> > iterate on keys at all; maybe today it can actually via some trick,
> 
> > but the point is this is a conceptual mismatch for it.
> 

> > Where you say this doesn't require nework traffic you need to consider
> 
> > that while it's true this might not be using the physical network wire
> 
> > being localhost, it would still be transferred over a costly network
> 
> > stream, as we don't do off-heap buffer sharing yet.
> 

> > > So to clarify, we will have a cluster of nodes where each node contains
> > > two
> > > JVM, one running an Hadoop process, one running an Infinispan process.
> > > The
> > > Hadoop process would only read the data from the Infinispan process in
> > > the
> > > same node during a normal M/R execution.
> > 
> 

> > So we discussed two use cases:
> 
> > - engage Infinispan to accelerate an existing Hadoop deployment
> 
> > - engage Hadoop to run an Hadoop job on existing data in Infinispan
> 
> > In neither case I see why I'd run them in separate JVMs: seems less
> 
> > effective and more work to get done, and no benefit unless you're
> 
> > thinking about independent JVM tuning? That might be something to
> 
> > consider, but I doubt tuning independence would ever offset the cost
> 
> > of serialized transfer of each entry.
> 

> > The second use case could be used via Hot Rod too, but that's a
> 
> > different discussion, actually just a nice side effect of Hadoop being
> 
> > language agnostic that we would take advantage of.
> 

> > Sanne
> 

> > > _______________________________________________
> > 
> 
> > > infinispan-dev mailing list
> > 
> 
> > > infinispan-dev at lists.jboss.org
> > 
> 
> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> > 
> 

> > _______________________________________________
> 
> > infinispan-dev mailing list
> 
> > infinispan-dev at lists.jboss.org
> 
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 

> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140317/9e280355/attachment-0001.html