For Map/Reduce v1 this is definitely the case:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Task+Execution...
" The TaskTracker executes the Mapper / Reducer task as a child process in a separate
jvm. "
I believe this is also the case for Map/Reduce v2, but I haven't found a definitive
reference in the docs yet. YARN is architected to split resource management and job
scheduling/monitoring into different pieces, but I think task execution is the same as
MRv1.
Thanks,
Alan
----- Original Message -----
From: "Emmanuel Bernard" <emmanuel(a)hibernate.org>
To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org>
Sent: Monday, March 17, 2014 11:31:34 AM
Subject: Re: [infinispan-dev] Infinispan - Hadoop integration
Got it now.
That being said, if Alan is correct (one JVM per M/R task run per node), we
will need to implement C/S local key and keyset lookup.
Emmanuel
On 14 Mar 2014, at 12:34, Sanne Grinovero < sanne(a)infinispan.org
> wrote:
> On 14 March 2014 09:06, Emmanuel Bernard <
emmanuel(a)hibernate.org > wrote:
> > > On 13 mars 2014, at 23:39, Sanne Grinovero <
sanne(a)infinispan.org >
> > > wrote:
> >
>
> > > > On 13 March 2014 22:19, Mircea Markus <
mmarkus(a)redhat.com > wrote:
> > >
> >
>
> > > > > On Mar 13, 2014, at 22:17, Sanne Grinovero
< sanne(a)infinispan.org >
> > > > > wrote:
> > > >
> > >
> >
>
> > > > > > On 13 March 2014 22:05, Mircea Markus
< mmarkus(a)redhat.com >
> > > > > > wrote:
> > > > >
> > > >
> > >
> >
>
> > > > > > On Mar 13, 2014, at 20:59, Ales Justin
< ales.justin(a)gmail.com >
> > > > > > wrote:
> > > > >
> > > >
> > >
> >
>
> > > > > > > > - also important to notice
that we will have both an Hadoop
> > > > > > > > and
> > > > > > > > an
> > > > > > > > Infinispan
> > > > > > > > cluster running in parallel: the user will
interact with the
> > > > > > > > former
> > > > > > > > in
> > > > > > > > order
> > > > > > > > to run M/R tasks. Hadoop will use Infinispan
(integration
> > > > > > > > achieved
> > > > > > > > through
> > > > > > > > InputFormat and OutputFormat ) in order to get
the data to be
> > > > > > > > processed.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > > Would this be 2 JVMs, or you can
trick Hadoop to start
> > > > > > > Infinispan
> > > > > > > as
> > > > > > > well
> > > > > > > --
> > > > > > > hence 1JVM?
> > > > > >
> > > > >
> > > >
> > >
> >
>
> > > > > > good point, ideally it should be a
single VM: reduced
> > > > > > serialization
> > > > > > cost
> > > > > > (in
> > > > > > vm access) and simpler architecture. That's if
you're not using
> > > > > > C/S
> > > > > > mode,
> > > > > > of
> > > > > > course.
> > > > >
> > > >
> > >
> >
>
> > > > > ?
> > > >
> > >
> >
>
> > > > > Don't try confusing us again on that :-)
> > > >
> > >
> >
>
> > > > > I think we agreed that the job would *always* run in strict
> > > > > locality
> > > >
> > >
> >
>
> > > > > with the datacontainer (i.e. in the same JVM). Sure, an Hadoop
> > > > > client
> > > >
> > >
> >
>
> > > > > would be connecting from somewhere else but that's
unrelated.
> > > >
> > >
> >
>
> > > > we did discuss the possibility of running it over
hotrod though, do
> > > > you
> > > > see
> > > > a
> > > > problem with that?
> > >
> >
>
> > > No of course not, we discussed that. I just mean I
think that needs to
> >
>
> > > be clarified on the list that the Hadoop engine will always run in the
> >
>
> > > same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
> >
>
> > > native clients, or Hadoop clients over Hot Rod) can indeed connect
> >
>
> > > remotely, but it's important to clarify that the processing itself
> >
>
> > > will take advantage of locality in all configurations. In other words,
> >
>
> > > to clarify that the serialization cost you mention for clients is just
> >
>
> > > to transfer the job definition and optionally the final processing
> >
>
> > > result.
> >
>
> > Not quite. The serialization cost Mircea mentions I think
is between the
> > Hadoop vm and the Infinispan vm on a single node. The serialization does
> > not
> > require network traffic but is still shuffling data between two processes
> > basically. We could eliminate this by starting both Hadoop and Infinispan
> > from the same VM but that requires more work than necessary for a
> > prototype.
>
> Ok so there was indeed confusion on terminology: I don't
agree with that
> design.
> > From an implementor's effort perspective having to
setup an Hot Rod
>
> client rather than embedding an Infinispan node is approximately
the
> same work, or slightly more as you have to start both. Also to test
> it, embedded mode it easier.
> Hot Rod is not meant to be used on the same node, especially not
if
> you only want to access data in strict locality; for example it
> wouldn't be able to iterated on all keys of the current server node
> (and limiting to those keys only). I might be wrong as I'm not too
> familiar with Hot Rod, but I think it might not even be able to
> iterate on keys at all; maybe today it can actually via some trick,
> but the point is this is a conceptual mismatch for it.
> Where you say this doesn't require nework traffic you need
to consider
> that while it's true this might not be using the physical network wire
> being localhost, it would still be transferred over a costly network
> stream, as we don't do off-heap buffer sharing yet.
> > So to clarify, we will have a cluster of nodes where each
node contains
> > two
> > JVM, one running an Hadoop process, one running an Infinispan process.
> > The
> > Hadoop process would only read the data from the Infinispan process in
> > the
> > same node during a normal M/R execution.
>
> So we discussed two use cases:
> - engage Infinispan to accelerate an existing Hadoop deployment
> - engage Hadoop to run an Hadoop job on existing data in Infinispan
> In neither case I see why I'd run them in separate JVMs: seems less
> effective and more work to get done, and no benefit unless you're
> thinking about independent JVM tuning? That might be something to
> consider, but I doubt tuning independence would ever offset the cost
> of serialized transfer of each entry.
> The second use case could be used via Hot Rod too, but
that's a
> different discussion, actually just a nice side effect of Hadoop being
> language agnostic that we would take advantage of.
> Sanne
> > _______________________________________________
>
> > infinispan-dev mailing list
>
> > infinispan-dev(a)lists.jboss.org
>
> >
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev