Re: [infinispan-dev] Infinispan - Hadoop integration

Monday, 17 March 2014

Yes, the M/R v1 by default launches one *new* JVM per task, so during the
execution of a certain Job, at a given moment in a node there could be
dozens of JVMs running in parallel, that will be destroyed when the task
(map or reduce) finishes. It is possible to instruct the map reduce system
to reuse the same JVM for several map or reduce tasks: this is interesting
when map tasks executes in a matter of seconds and the overhead of
creating, warming up and destroying a JVM becomes significant. But even in
this case, there will be 'n' JVM running where 'n' is the task capacity
of
the node. The difference is that they are recycled.

In Yarn the behaviour is similar, the YarnChild runs in a separate JVM and
it's possible to cause some reuse setting the property "mapreduce.job.
ubertask.enable"
 Apart from all those transient tasks JVMs, there will more long running
JVMs in each node which is the TaskTracker (who accepts tasks and send
status to the global jobtracker)and If HDFS is used, there will be one
extra JVM per node (DataNode) plus one or two Namenode global processes.

Hadoop is very fond of JVMs.

Cheers,
Gustavo

On Mon, Mar 17, 2014 at 4:06 PM, Alan Field <afield(a)redhat.com&gt; wrote:

...
 For Map/Reduce v1 this is definitely the case:

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Task+Execution...

 "The TaskTracker executes the Mapper/ Reducer *task* as a child process
 in a separate jvm."

 I believe this is also the case for Map/Reduce v2, but I haven't found a
 definitive reference in the docs yet. YARN is architected to split resource
 management and job scheduling/monitoring into different pieces, but I think
 task execution is the same as MRv1.

 Thanks,
 Alan

 ------------------------------

 *From: *"Emmanuel Bernard" <emmanuel(a)hibernate.org&gt;
 *To: *"infinispan -Dev List" <infinispan-dev(a)lists.jboss.org&gt;
 *Sent: *Monday, March 17, 2014 11:31:34 AM
 *Subject: *Re: [infinispan-dev] Infinispan - Hadoop integration

 Got it now.
 That being said, if Alan is correct (one JVM per M/R task run per node),
 we will need to implement C/S local key and keyset lookup.

 Emmanuel

 On 14 Mar 2014, at 12:34, Sanne Grinovero <sanne(a)infinispan.org&gt; wrote:

 On 14 March 2014 09:06, Emmanuel Bernard <emmanuel(a)hibernate.org&gt; wrote:

 On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org&gt; wrote:

 On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com&gt; wrote:

 On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org&gt; wrote:

 On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com&gt; wrote:

 On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com&gt; wrote:

 - also important to notice that we will have both an Hadoop and an
 Infinispan cluster running in parallel: the user will interact with the
 former in order to run M/R tasks. Hadoop will use Infinispan (integration
 achieved through InputFormat and OutputFormat ) in order to get the data to
 be processed.

 Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well
 -- hence 1JVM?

 good point, ideally it should be a single VM: reduced serialization cost
 (in vm access) and simpler architecture. That's if you're not using C/S
 mode, of course.

 ?
 Don't try confusing us again on that :-)
 I think we agreed that the job would *always* run in strict locality
 with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
 would be connecting from somewhere else but that's unrelated.

 we did discuss the possibility of running it over hotrod though, do you
 see a problem with that?

 No of course not, we discussed that. I just mean I think that needs to
 be clarified on the list that the Hadoop engine will always run in the
 same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
 native clients, or Hadoop clients over Hot Rod) can indeed connect
 remotely, but it's important to clarify that the processing itself
 will take advantage of locality in all configurations. In other words,
 to clarify that the serialization cost you mention for clients is just
 to transfer the job definition and optionally the final processing
 result.

 Not quite. The serialization cost Mircea mentions I think is between the
 Hadoop vm and the Infinispan vm on a single node. The serialization does
 not require network traffic but is still shuffling data between two
 processes basically. We could eliminate this by starting both Hadoop and
 Infinispan from the same VM but that requires more work than necessary for
 a prototype.

 Ok so there was indeed confusion on terminology: I don't agree with that
 design.

 From an implementor's effort perspective having to setup an Hot Rod

 client rather than embedding an Infinispan node is approximately the
 same work, or slightly more as you have to start both. Also to test
 it, embedded mode it easier.

 Hot Rod is not meant to be used on the same node, especially not if
 you only want to access data in strict locality; for example it
 wouldn't be able to iterated on all keys of the current server node
 (and limiting to those keys only). I might be wrong as I'm not too
 familiar with Hot Rod, but I think it might not even be able to
 iterate on keys at all; maybe today it can actually via some trick,
 but the point is this is a conceptual mismatch for it.

 Where you say this doesn't require nework traffic you need to consider
 that while it's true this might not be using the physical network wire
 being localhost, it would still be transferred over a costly network
 stream, as we don't do off-heap buffer sharing yet.

 So to clarify, we will have a cluster of nodes where each node contains
 two JVM, one running an Hadoop process, one running an Infinispan process.
 The Hadoop process would only read the data from the Infinispan process in
 the same node during a normal M/R execution.

 So we discussed two use cases:
 - engage Infinispan to accelerate an existing Hadoop deployment
 - engage Hadoop to run an Hadoop job on existing data in Infinispan
 In neither case I see why I'd run them in separate JVMs: seems less
 effective and more work to get done, and no benefit unless you're
 thinking about independent JVM tuning? That might be something to
 consider, but I doubt tuning independence would ever offset the cost
 of serialized transfer of each entry.

 The second use case could be used via Hot Rod too, but that's a
 different discussion, actually just a nice side effect of Hadoop being
 language agnostic that we would take advantage of.

 Sanne

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] Infinispan - Hadoop integration