[infinispan-dev] Infinispan - Hadoop integration
Alan Field
afield at redhat.com
Mon Mar 17 12:06:40 EDT 2014
For Map/Reduce v1 this is definitely the case:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Task+Execution+%26+Environment
" The TaskTracker executes the Mapper / Reducer task as a child process in a separate jvm. "
I believe this is also the case for Map/Reduce v2, but I haven't found a definitive reference in the docs yet. YARN is architected to split resource management and job scheduling/monitoring into different pieces, but I think task execution is the same as MRv1.
Thanks,
Alan
----- Original Message -----
> From: "Emmanuel Bernard" <emmanuel at hibernate.org>
> To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
> Sent: Monday, March 17, 2014 11:31:34 AM
> Subject: Re: [infinispan-dev] Infinispan - Hadoop integration
> Got it now.
> That being said, if Alan is correct (one JVM per M/R task run per node), we
> will need to implement C/S local key and keyset lookup.
> Emmanuel
> On 14 Mar 2014, at 12:34, Sanne Grinovero < sanne at infinispan.org > wrote:
> > On 14 March 2014 09:06, Emmanuel Bernard < emmanuel at hibernate.org > wrote:
>
> > > > On 13 mars 2014, at 23:39, Sanne Grinovero < sanne at infinispan.org >
> > > > wrote:
> > >
> >
>
> > > > > On 13 March 2014 22:19, Mircea Markus < mmarkus at redhat.com > wrote:
> > > >
> > >
> >
>
> > > > > > On Mar 13, 2014, at 22:17, Sanne Grinovero < sanne at infinispan.org >
> > > > > > wrote:
> > > > >
> > > >
> > >
> >
>
> > > > > > > On 13 March 2014 22:05, Mircea Markus < mmarkus at redhat.com >
> > > > > > > wrote:
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > > On Mar 13, 2014, at 20:59, Ales Justin < ales.justin at gmail.com >
> > > > > > > wrote:
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > > > > - also important to notice that we will have both an Hadoop
> > > > > > > > > and
> > > > > > > > > an
> > > > > > > > > Infinispan
> > > > > > > > > cluster running in parallel: the user will interact with the
> > > > > > > > > former
> > > > > > > > > in
> > > > > > > > > order
> > > > > > > > > to run M/R tasks. Hadoop will use Infinispan (integration
> > > > > > > > > achieved
> > > > > > > > > through
> > > > > > > > > InputFormat and OutputFormat ) in order to get the data to be
> > > > > > > > > processed.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > > > Would this be 2 JVMs, or you can trick Hadoop to start
> > > > > > > > Infinispan
> > > > > > > > as
> > > > > > > > well
> > > > > > > > --
> > > > > > > > hence 1JVM?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > > good point, ideally it should be a single VM: reduced
> > > > > > > serialization
> > > > > > > cost
> > > > > > > (in
> > > > > > > vm access) and simpler architecture. That's if you're not using
> > > > > > > C/S
> > > > > > > mode,
> > > > > > > of
> > > > > > > course.
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > ?
> > > > >
> > > >
> > >
> >
>
> > > > > > Don't try confusing us again on that :-)
> > > > >
> > > >
> > >
> >
>
> > > > > > I think we agreed that the job would *always* run in strict
> > > > > > locality
> > > > >
> > > >
> > >
> >
>
> > > > > > with the datacontainer (i.e. in the same JVM). Sure, an Hadoop
> > > > > > client
> > > > >
> > > >
> > >
> >
>
> > > > > > would be connecting from somewhere else but that's unrelated.
> > > > >
> > > >
> > >
> >
>
> > > > > we did discuss the possibility of running it over hotrod though, do
> > > > > you
> > > > > see
> > > > > a
> > > > > problem with that?
> > > >
> > >
> >
>
> > > > No of course not, we discussed that. I just mean I think that needs to
> > >
> >
>
> > > > be clarified on the list that the Hadoop engine will always run in the
> > >
> >
>
> > > > same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
> > >
> >
>
> > > > native clients, or Hadoop clients over Hot Rod) can indeed connect
> > >
> >
>
> > > > remotely, but it's important to clarify that the processing itself
> > >
> >
>
> > > > will take advantage of locality in all configurations. In other words,
> > >
> >
>
> > > > to clarify that the serialization cost you mention for clients is just
> > >
> >
>
> > > > to transfer the job definition and optionally the final processing
> > >
> >
>
> > > > result.
> > >
> >
>
> > > Not quite. The serialization cost Mircea mentions I think is between the
> > > Hadoop vm and the Infinispan vm on a single node. The serialization does
> > > not
> > > require network traffic but is still shuffling data between two processes
> > > basically. We could eliminate this by starting both Hadoop and Infinispan
> > > from the same VM but that requires more work than necessary for a
> > > prototype.
> >
>
> > Ok so there was indeed confusion on terminology: I don't agree with that
> > design.
>
> > > From an implementor's effort perspective having to setup an Hot Rod
> >
>
> > client rather than embedding an Infinispan node is approximately the
>
> > same work, or slightly more as you have to start both. Also to test
>
> > it, embedded mode it easier.
>
> > Hot Rod is not meant to be used on the same node, especially not if
>
> > you only want to access data in strict locality; for example it
>
> > wouldn't be able to iterated on all keys of the current server node
>
> > (and limiting to those keys only). I might be wrong as I'm not too
>
> > familiar with Hot Rod, but I think it might not even be able to
>
> > iterate on keys at all; maybe today it can actually via some trick,
>
> > but the point is this is a conceptual mismatch for it.
>
> > Where you say this doesn't require nework traffic you need to consider
>
> > that while it's true this might not be using the physical network wire
>
> > being localhost, it would still be transferred over a costly network
>
> > stream, as we don't do off-heap buffer sharing yet.
>
> > > So to clarify, we will have a cluster of nodes where each node contains
> > > two
> > > JVM, one running an Hadoop process, one running an Infinispan process.
> > > The
> > > Hadoop process would only read the data from the Infinispan process in
> > > the
> > > same node during a normal M/R execution.
> >
>
> > So we discussed two use cases:
>
> > - engage Infinispan to accelerate an existing Hadoop deployment
>
> > - engage Hadoop to run an Hadoop job on existing data in Infinispan
>
> > In neither case I see why I'd run them in separate JVMs: seems less
>
> > effective and more work to get done, and no benefit unless you're
>
> > thinking about independent JVM tuning? That might be something to
>
> > consider, but I doubt tuning independence would ever offset the cost
>
> > of serialized transfer of each entry.
>
> > The second use case could be used via Hot Rod too, but that's a
>
> > different discussion, actually just a nice side effect of Hadoop being
>
> > language agnostic that we would take advantage of.
>
> > Sanne
>
> > > _______________________________________________
> >
>
> > > infinispan-dev mailing list
> >
>
> > > infinispan-dev at lists.jboss.org
> >
>
> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
>
> > _______________________________________________
>
> > infinispan-dev mailing list
>
> > infinispan-dev at lists.jboss.org
>
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140317/9e280355/attachment-0001.html
More information about the infinispan-dev
mailing list