Infinispan - Hadoop integration - infinispan-dev

Infinispan - Hadoop integration

event processing integration

Learning ISPN7 DataContainer...

Mircea Markus

Thursday, 13 March 2014 Thu, 13 Mar '14

2:01 p.m.

Hi, I had a very good conversation with Jonathan Halliday, Sanne and Emmanuel around the integration between Infinispan and Hadoop. Just to recap, the goal is to be able to run Hadoop M/R tasks on data that is stored in Infinispan in order to gain speed. (once we have a prototype in place, one of the first tasks will be to validate this speed assumptions). In previous discussions we explored the idea of providing an HDFS implementation for Infinispan, which whilst doable, might not be the best integration point: - in order to run M/R jobs, hadoop interacts with two interfaces: InputFormat[1] and OutputFormat[2] - it's the specific InputFormat and OutputForman implementations that work on top of HDFS - instead of implementing HDFS, we could provide implementations for the InputFormat and OutputFormat interfaces, which would give us more flexibility - this seems to be the preferred integration point for other systems, such as Cassandra - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. - Assumptions that we'll need to validate: this approach doesn't impose any constraint on how data is stored in Infinispan and should allow data to be read through the Map interface. Also InputFormat and OutputFormat implementations would only use get(k) and keySet() methods, and no native Infinispan M/R, which means that C/S access should also be possible. - very important: hadoop HDFS is an append only file system, and the M/R tasks operate on a snapshot of the data. From a task's perspective, all the data in the storage doesn't change after the task is started. More data can be appended whilst the task runs, but this won't be visible by the task. Infinispan doesn't have such an append structure, nor MVCC. The closest thing we have is the Snapshot isolation transactions implemented by the cloudTM project (this is not integrated yet). I assume that the M/R tasks are built with this snapshot-issolation requirement from the storage - this is something that we should investigate as well. It is possible that, in the first stages of this integration, we would require data stored in ISPM to be read only. [1] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inpu... [2] http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Outp... Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Show replies by date

Ales Justin

Thursday, 13 March Thu, 13 Mar

3:59 p.m.

...

- also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed.

Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? -Ales

Mircea Markus

5:05 p.m.

On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote:

...

> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM?

good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Sanne Grinovero

5:17 p.m.

On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote:

...

On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: >> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. > > Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course.

? Don't try confusing us again on that :-) I think we agreed that the job would *always* run in strict locality with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client would be connecting from somewhere else but that's unrelated.

...

Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Mircea Markus

5:19 p.m.

On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote:

...

On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: > > On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: > >>> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. >> >> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? > > good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. ? Don't try confusing us again on that :-) I think we agreed that the job would *always* run in strict locality with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client would be connecting from somewhere else but that's unrelated.

we did discuss the possibility of running it over hotrod though, do you see a problem with that? Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Sanne Grinovero

6:39 p.m.

On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote:

...

On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: > On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: >> >> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: >> >>>> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. >>> >>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? >> >> good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. > > ? > Don't try confusing us again on that :-) > I think we agreed that the job would *always* run in strict locality > with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client > would be connecting from somewhere else but that's unrelated. we did discuss the possibility of running it over hotrod though, do you see a problem with that?

No of course not, we discussed that. I just mean I think that needs to be clarified on the list that the Hadoop engine will always run in the same JVM. Clients (be it Hot Rod via new custom commands or Hadoop native clients, or Hadoop clients over Hot Rod) can indeed connect remotely, but it's important to clarify that the processing itself will take advantage of locality in all configurations. In other words, to clarify that the serialization cost you mention for clients is just to transfer the job definition and optionally the final processing result. Sanne

...

Emmanuel Bernard

Friday, 14 March Fri, 14 Mar

4:06 a.m.

...

On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote: > On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote: > >> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: >> >>> On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: >>> >>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: >>> >>>>> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. >>>> >>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? >>> >>> good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. >> >> ? >> Don't try confusing us again on that :-) >> I think we agreed that the job would *always* run in strict locality >> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client >> would be connecting from somewhere else but that's unrelated. > > we did discuss the possibility of running it over hotrod though, do you see a problem with that? No of course not, we discussed that. I just mean I think that needs to be clarified on the list that the Hadoop engine will always run in the same JVM. Clients (be it Hot Rod via new custom commands or Hadoop native clients, or Hadoop clients over Hot Rod) can indeed connect remotely, but it's important to clarify that the processing itself will take advantage of locality in all configurations. In other words, to clarify that the serialization cost you mention for clients is just to transfer the job definition and optionally the final processing result.

Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and the Infinispan vm on a single node. The serialization does not require network traffic but is still shuffling data between two processes basically. We could eliminate this by starting both Hadoop and Infinispan from the same VM but that requires more work than necessary for a prototype. So to clarify, we will have a cluster of nodes where each node contains two JVM, one running an Hadoop process, one running an Infinispan process. The Hadoop process would only read the data from the Infinispan process in the same node during a normal M/R execution.

Sanne Grinovero

6:34 a.m.

On 14 March 2014 09:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote: > >> On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote: >> >>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: >>> >>>> On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: >>>> >>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: >>>> >>>>>> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. >>>>> >>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? >>>> >>>> good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. >>> >>> ? >>> Don't try confusing us again on that :-) >>> I think we agreed that the job would *always* run in strict locality >>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client >>> would be connecting from somewhere else but that's unrelated. >> >> we did discuss the possibility of running it over hotrod though, do you see a problem with that? > > No of course not, we discussed that. I just mean I think that needs to > be clarified on the list that the Hadoop engine will always run in the > same JVM. Clients (be it Hot Rod via new custom commands or Hadoop > native clients, or Hadoop clients over Hot Rod) can indeed connect > remotely, but it's important to clarify that the processing itself > will take advantage of locality in all configurations. In other words, > to clarify that the serialization cost you mention for clients is just > to transfer the job definition and optionally the final processing > result. > Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and the Infinispan vm on a single node. The serialization does not require network traffic but is still shuffling data between two processes basically. We could eliminate this by starting both Hadoop and Infinispan from the same VM but that requires more work than necessary for a prototype.

Ok so there was indeed confusion on terminology: I don't agree with that design.

...

From an implementor's effort perspective having to setup an Hot Rod

client rather than embedding an Infinispan node is approximately the same work, or slightly more as you have to start both. Also to test it, embedded mode it easier. Hot Rod is not meant to be used on the same node, especially not if you only want to access data in strict locality; for example it wouldn't be able to iterated on all keys of the current server node (and limiting to those keys only). I might be wrong as I'm not too familiar with Hot Rod, but I think it might not even be able to iterate on keys at all; maybe today it can actually via some trick, but the point is this is a conceptual mismatch for it. Where you say this doesn't require nework traffic you need to consider that while it's true this might not be using the physical network wire being localhost, it would still be transferred over a costly network stream, as we don't do off-heap buffer sharing yet.

...

So to clarify, we will have a cluster of nodes where each node contains two JVM, one running an Hadoop process, one running an Infinispan process. The Hadoop process would only read the data from the Infinispan process in the same node during a normal M/R execution.

So we discussed two use cases: - engage Infinispan to accelerate an existing Hadoop deployment - engage Hadoop to run an Hadoop job on existing data in Infinispan In neither case I see why I'd run them in separate JVMs: seems less effective and more work to get done, and no benefit unless you're thinking about independent JVM tuning? That might be something to consider, but I doubt tuning independence would ever offset the cost of serialized transfer of each entry. The second use case could be used via Hot Rod too, but that's a different discussion, actually just a nice side effect of Hadoop being language agnostic that we would take advantage of. Sanne

...

_______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Emmanuel Bernard

Monday, 17 March Mon, 17 Mar

10:31 a.m.

Got it now. That being said, if Alan is correct (one JVM per M/R task run per node), we will need to implement C/S local key and keyset lookup. Emmanuel On 14 Mar 2014, at 12:34, Sanne Grinovero <sanne(a)infinispan.org> wrote:

...

On 14 March 2014 09:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > > >> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote: >> >>> On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote: >>> >>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: >>>> >>>>> On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: >>>>> >>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: >>>>> >>>>>>> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. >>>>>> >>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? >>>>> >>>>> good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. >>>> >>>> ? >>>> Don't try confusing us again on that :-) >>>> I think we agreed that the job would *always* run in strict locality >>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client >>>> would be connecting from somewhere else but that's unrelated. >>> >>> we did discuss the possibility of running it over hotrod though, do you see a problem with that? >> >> No of course not, we discussed that. I just mean I think that needs to >> be clarified on the list that the Hadoop engine will always run in the >> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop >> native clients, or Hadoop clients over Hot Rod) can indeed connect >> remotely, but it's important to clarify that the processing itself >> will take advantage of locality in all configurations. In other words, >> to clarify that the serialization cost you mention for clients is just >> to transfer the job definition and optionally the final processing >> result. >> > > Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and the Infinispan vm on a single node. The serialization does not require network traffic but is still shuffling data between two processes basically. We could eliminate this by starting both Hadoop and Infinispan from the same VM but that requires more work than necessary for a prototype. Ok so there was indeed confusion on terminology: I don't agree with that design. > From an implementor's effort perspective having to setup an Hot Rod client rather than embedding an Infinispan node is approximately the same work, or slightly more as you have to start both. Also to test it, embedded mode it easier. Hot Rod is not meant to be used on the same node, especially not if you only want to access data in strict locality; for example it wouldn't be able to iterated on all keys of the current server node (and limiting to those keys only). I might be wrong as I'm not too familiar with Hot Rod, but I think it might not even be able to iterate on keys at all; maybe today it can actually via some trick, but the point is this is a conceptual mismatch for it. Where you say this doesn't require nework traffic you need to consider that while it's true this might not be using the physical network wire being localhost, it would still be transferred over a costly network stream, as we don't do off-heap buffer sharing yet. > So to clarify, we will have a cluster of nodes where each node contains two JVM, one running an Hadoop process, one running an Infinispan process. The Hadoop process would only read the data from the Infinispan process in the same node during a normal M/R execution. So we discussed two use cases: - engage Infinispan to accelerate an existing Hadoop deployment - engage Hadoop to run an Hadoop job on existing data in Infinispan In neither case I see why I'd run them in separate JVMs: seems less effective and more work to get done, and no benefit unless you're thinking about independent JVM tuning? That might be something to consider, but I doubt tuning independence would ever offset the cost of serialized transfer of each entry. The second use case could be used via Hot Rod too, but that's a different discussion, actually just a nice side effect of Hadoop being language agnostic that we would take advantage of. Sanne > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Alan Field

11:06 a.m.

...

From: "Emmanuel Bernard" <emmanuel(a)hibernate.org> To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org> Sent: Monday, March 17, 2014 11:31:34 AM Subject: Re: [infinispan-dev] Infinispan - Hadoop integration

...

Got it now. That being said, if Alan is correct (one JVM per M/R task run per node), we will need to implement C/S local key and keyset lookup.

...

Emmanuel

...

On 14 Mar 2014, at 12:34, Sanne Grinovero < sanne(a)infinispan.org > wrote:

...

> On 14 March 2014 09:06, Emmanuel Bernard < emmanuel(a)hibernate.org > wrote:

...

> > > On 13 mars 2014, at 23:39, Sanne Grinovero < sanne(a)infinispan.org > > > > wrote: > > >

...

> > > > On 13 March 2014 22:19, Mircea Markus < mmarkus(a)redhat.com > wrote: > > > > > >

...

> > > > > On Mar 13, 2014, at 22:17, Sanne Grinovero < sanne(a)infinispan.org > > > > > > wrote: > > > > > > > > > >

...

> > > > > > On 13 March 2014 22:05, Mircea Markus < mmarkus(a)redhat.com > > > > > > > wrote: > > > > > > > > > > > > > > >

...

> > > > > > On Mar 13, 2014, at 20:59, Ales Justin < ales.justin(a)gmail.com > > > > > > > wrote: > > > > > > > > > > > > > > >

...

> > > > > > > > - also important to notice that we will have both an Hadoop > > > > > > > > and > > > > > > > > an > > > > > > > > Infinispan > > > > > > > > cluster running in parallel: the user will interact with the > > > > > > > > former > > > > > > > > in > > > > > > > > order > > > > > > > > to run M/R tasks. Hadoop will use Infinispan (integration > > > > > > > > achieved > > > > > > > > through > > > > > > > > InputFormat and OutputFormat ) in order to get the data to be > > > > > > > > processed. > > > > > > > > > > > > > > > > > > > > > > > > > > > >

...

> > > > > > > Would this be 2 JVMs, or you can trick Hadoop to start > > > > > > > Infinispan > > > > > > > as > > > > > > > well > > > > > > > -- > > > > > > > hence 1JVM? > > > > > > > > > > > > > > > > > > > > >

...

> > > > > > good point, ideally it should be a single VM: reduced > > > > > > serialization > > > > > > cost > > > > > > (in > > > > > > vm access) and simpler architecture. That's if you're not using > > > > > > C/S > > > > > > mode, > > > > > > of > > > > > > course. > > > > > > > > > > > > > > >

...

> > > > > ? > > > > > > > > > > > > > > > Don't try confusing us again on that :-) > > > > > > > > > > > > > > > I think we agreed that the job would *always* run in strict > > > > > locality > > > > > > > > > > > > > > > with the datacontainer (i.e. in the same JVM). Sure, an Hadoop > > > > > client > > > > > > > > > > > > > > > would be connecting from somewhere else but that's unrelated. > > > > > > > > > >

...

> > > > we did discuss the possibility of running it over hotrod though, do > > > > you > > > > see > > > > a > > > > problem with that? > > > > > >

...

> > > No of course not, we discussed that. I just mean I think that needs to > > > > > > be clarified on the list that the Hadoop engine will always run in the > > > > > > same JVM. Clients (be it Hot Rod via new custom commands or Hadoop > > > > > > native clients, or Hadoop clients over Hot Rod) can indeed connect > > > > > > remotely, but it's important to clarify that the processing itself > > > > > > will take advantage of locality in all configurations. In other words, > > > > > > to clarify that the serialization cost you mention for clients is just > > > > > > to transfer the job definition and optionally the final processing > > > > > > result. > > >

...

> > Not quite. The serialization cost Mircea mentions I think is between the > > Hadoop vm and the Infinispan vm on a single node. The serialization does > > not > > require network traffic but is still shuffling data between two processes > > basically. We could eliminate this by starting both Hadoop and Infinispan > > from the same VM but that requires more work than necessary for a > > prototype. >

...

> Ok so there was indeed confusion on terminology: I don't agree with that > design.

...

> > From an implementor's effort perspective having to setup an Hot Rod >

...

> client rather than embedding an Infinispan node is approximately the > same work, or slightly more as you have to start both. Also to test > it, embedded mode it easier.

...

> Hot Rod is not meant to be used on the same node, especially not if > you only want to access data in strict locality; for example it > wouldn't be able to iterated on all keys of the current server node > (and limiting to those keys only). I might be wrong as I'm not too > familiar with Hot Rod, but I think it might not even be able to > iterate on keys at all; maybe today it can actually via some trick, > but the point is this is a conceptual mismatch for it.

...

> Where you say this doesn't require nework traffic you need to consider > that while it's true this might not be using the physical network wire > being localhost, it would still be transferred over a costly network > stream, as we don't do off-heap buffer sharing yet.

...

> > So to clarify, we will have a cluster of nodes where each node contains > > two > > JVM, one running an Hadoop process, one running an Infinispan process. > > The > > Hadoop process would only read the data from the Infinispan process in > > the > > same node during a normal M/R execution. >

...

> So we discussed two use cases: > - engage Infinispan to accelerate an existing Hadoop deployment > - engage Hadoop to run an Hadoop job on existing data in Infinispan > In neither case I see why I'd run them in separate JVMs: seems less > effective and more work to get done, and no benefit unless you're > thinking about independent JVM tuning? That might be something to > consider, but I doubt tuning independence would ever offset the cost > of serialized transfer of each entry.

...

> The second use case could be used via Hot Rod too, but that's a > different discussion, actually just a nice side effect of Hadoop being > language agnostic that we would take advantage of.

...

> Sanne

...

> > _______________________________________________ > > > infinispan-dev mailing list > > > infinispan-dev(a)lists.jboss.org > > > https://lists.jboss.org/mailman/listinfo/infinispan-dev >

...

> _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev

...

_______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Gustavo Fernandes

11:49 a.m.

...

For Map/Reduce v1 this is definitely the case: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Task+Execution... "The TaskTracker executes the Mapper/ Reducer *task* as a child process in a separate jvm." I believe this is also the case for Map/Reduce v2, but I haven't found a definitive reference in the docs yet. YARN is architected to split resource management and job scheduling/monitoring into different pieces, but I think task execution is the same as MRv1. Thanks, Alan ------------------------------ *From: *"Emmanuel Bernard" <emmanuel(a)hibernate.org> *To: *"infinispan -Dev List" <infinispan-dev(a)lists.jboss.org> *Sent: *Monday, March 17, 2014 11:31:34 AM *Subject: *Re: [infinispan-dev] Infinispan - Hadoop integration Got it now. That being said, if Alan is correct (one JVM per M/R task run per node), we will need to implement C/S local key and keyset lookup. Emmanuel On 14 Mar 2014, at 12:34, Sanne Grinovero <sanne(a)infinispan.org> wrote: On 14 March 2014 09:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote: On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote: On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. ? Don't try confusing us again on that :-) I think we agreed that the job would *always* run in strict locality with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client would be connecting from somewhere else but that's unrelated. we did discuss the possibility of running it over hotrod though, do you see a problem with that? No of course not, we discussed that. I just mean I think that needs to be clarified on the list that the Hadoop engine will always run in the same JVM. Clients (be it Hot Rod via new custom commands or Hadoop native clients, or Hadoop clients over Hot Rod) can indeed connect remotely, but it's important to clarify that the processing itself will take advantage of locality in all configurations. In other words, to clarify that the serialization cost you mention for clients is just to transfer the job definition and optionally the final processing result. Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and the Infinispan vm on a single node. The serialization does not require network traffic but is still shuffling data between two processes basically. We could eliminate this by starting both Hadoop and Infinispan from the same VM but that requires more work than necessary for a prototype. Ok so there was indeed confusion on terminology: I don't agree with that design. From an implementor's effort perspective having to setup an Hot Rod client rather than embedding an Infinispan node is approximately the same work, or slightly more as you have to start both. Also to test it, embedded mode it easier. Hot Rod is not meant to be used on the same node, especially not if you only want to access data in strict locality; for example it wouldn't be able to iterated on all keys of the current server node (and limiting to those keys only). I might be wrong as I'm not too familiar with Hot Rod, but I think it might not even be able to iterate on keys at all; maybe today it can actually via some trick, but the point is this is a conceptual mismatch for it. Where you say this doesn't require nework traffic you need to consider that while it's true this might not be using the physical network wire being localhost, it would still be transferred over a costly network stream, as we don't do off-heap buffer sharing yet. So to clarify, we will have a cluster of nodes where each node contains two JVM, one running an Hadoop process, one running an Infinispan process. The Hadoop process would only read the data from the Infinispan process in the same node during a normal M/R execution. So we discussed two use cases: - engage Infinispan to accelerate an existing Hadoop deployment - engage Hadoop to run an Hadoop job on existing data in Infinispan In neither case I see why I'd run them in separate JVMs: seems less effective and more work to get done, and no benefit unless you're thinking about independent JVM tuning? That might be something to consider, but I doubt tuning independence would ever offset the cost of serialized transfer of each entry. The second use case could be used via Hot Rod too, but that's a different discussion, actually just a nice side effect of Hadoop being language agnostic that we would take advantage of. Sanne _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Emmanuel Bernard

11:54 a.m.

Well that’s only a guestimate but if I had to put a number, this approach is going to be shit slow +- 10% compared to what Infinispan M/R does (despite all of its limitations). We can do the proto but at some stage we might want to take over and replace some of that JVM spawn logic. Especially for the use case where the grid is used in parallel to the Hadoop M/R. On 17 Mar 2014, at 17:49, Gustavo Fernandes <gustavonalle(a)gmail.com> wrote:

...

Yes, the M/R v1 by default launches one *new* JVM per task, so during the execution of a certain Job, at a given moment in a node there could be dozens of JVMs running in parallel, that will be destroyed when the task (map or reduce) finishes. It is possible to instruct the map reduce system to reuse the same JVM for several map or reduce tasks: this is interesting when map tasks executes in a matter of seconds and the overhead of creating, warming up and destroying a JVM becomes significant. But even in this case, there will be 'n' JVM running where 'n' is the task capacity of the node. The difference is that they are recycled. In Yarn the behaviour is similar, the YarnChild runs in a separate JVM and it's possible to cause some reuse setting the property "mapreduce.job.ubertask.enable" Apart from all those transient tasks JVMs, there will more long running JVMs in each node which is the TaskTracker (who accepts tasks and send status to the global jobtracker)and If HDFS is used, there will be one extra JVM per node (DataNode) plus one or two Namenode global processes. Hadoop is very fond of JVMs. Cheers, Gustavo On Mon, Mar 17, 2014 at 4:06 PM, Alan Field <afield(a)redhat.com> wrote: For Map/Reduce v1 this is definitely the case: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Task+Execution... "The TaskTracker executes the Mapper/ Reducer task as a child process in a separate jvm." I believe this is also the case for Map/Reduce v2, but I haven't found a definitive reference in the docs yet. YARN is architected to split resource management and job scheduling/monitoring into different pieces, but I think task execution is the same as MRv1. Thanks, Alan From: "Emmanuel Bernard" <emmanuel(a)hibernate.org> To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org> Sent: Monday, March 17, 2014 11:31:34 AM Subject: Re: [infinispan-dev] Infinispan - Hadoop integration Got it now. That being said, if Alan is correct (one JVM per M/R task run per node), we will need to implement C/S local key and keyset lookup. Emmanuel On 14 Mar 2014, at 12:34, Sanne Grinovero <sanne(a)infinispan.org> wrote: On 14 March 2014 09:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote: On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote: On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed. Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM? good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course. ? Don't try confusing us again on that :-) I think we agreed that the job would *always* run in strict locality with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client would be connecting from somewhere else but that's unrelated. we did discuss the possibility of running it over hotrod though, do you see a problem with that? No of course not, we discussed that. I just mean I think that needs to be clarified on the list that the Hadoop engine will always run in the same JVM. Clients (be it Hot Rod via new custom commands or Hadoop native clients, or Hadoop clients over Hot Rod) can indeed connect remotely, but it's important to clarify that the processing itself will take advantage of locality in all configurations. In other words, to clarify that the serialization cost you mention for clients is just to transfer the job definition and optionally the final processing result. Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and the Infinispan vm on a single node. The serialization does not require network traffic but is still shuffling data between two processes basically. We could eliminate this by starting both Hadoop and Infinispan from the same VM but that requires more work than necessary for a prototype. Ok so there was indeed confusion on terminology: I don't agree with that design. From an implementor's effort perspective having to setup an Hot Rod client rather than embedding an Infinispan node is approximately the same work, or slightly more as you have to start both. Also to test it, embedded mode it easier. Hot Rod is not meant to be used on the same node, especially not if you only want to access data in strict locality; for example it wouldn't be able to iterated on all keys of the current server node (and limiting to those keys only). I might be wrong as I'm not too familiar with Hot Rod, but I think it might not even be able to iterate on keys at all; maybe today it can actually via some trick, but the point is this is a conceptual mismatch for it. Where you say this doesn't require nework traffic you need to consider that while it's true this might not be using the physical network wire being localhost, it would still be transferred over a costly network stream, as we don't do off-heap buffer sharing yet. So to clarify, we will have a cluster of nodes where each node contains two JVM, one running an Hadoop process, one running an Infinispan process. The Hadoop process would only read the data from the Infinispan process in the same node during a normal M/R execution. So we discussed two use cases: - engage Infinispan to accelerate an existing Hadoop deployment - engage Hadoop to run an Hadoop job on existing data in Infinispan In neither case I see why I'd run them in separate JVMs: seems less effective and more work to get done, and no benefit unless you're thinking about independent JVM tuning? That might be something to consider, but I doubt tuning independence would ever offset the cost of serialized transfer of each entry. The second use case could be used via Hot Rod too, but that's a different discussion, actually just a nice side effect of Hadoop being language agnostic that we would take advantage of. Sanne _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Mircea Markus

Friday, 14 March Fri, 14 Mar

10:26 a.m.

On Mar 14, 2014, at 9:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

thanks for the clarification, indeed this is the serialization overhead I had in mind.

...

Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Alan Field

1:22 p.m.

Hey, First off, I think integrating Infinispan with Hadoop using the InputFormat and OutputFormat interfaces is a really good idea, instead of using the file system interfaces. That's the approach that Gluster is using, but I don't think it's ideal for Infinispan.

...

A regular non-master Hadoop node is running a DataNode daemon and a TaskTracker daemon in separate JVMs. Each Map/Reduce Task is also executed in a separate JVM. In a Hadoop 2.0, the TaskTracker daemon is replaced by a NodeManager daemon. This is the minimal number of JVM processes needed by a node, and there will be more if other services are running in the cluster. The first point I am trying to make is that there are many different JVM processes running on a Hadoop node. Then I am trying to understand which JVM process you are talking about when you say the "Hadoop process"? I *think* this is referring to the actual Map/Reduce Task, but I'm not sure. I would also say that trying to figure out how to integrate Infinispan with Apache Spark [1] would also be interesting. I don't know as much about it, but a lot of processing that is currently being performed in Map/Reduce will be migrated to Spark. It's certainly being billed as the replacement, or at least the future of Map/Reduce. Thanks, Alan [1] https://spark.apache.org/ ----- Original Message -----

...

From: "Mircea Markus" <mmarkus(a)redhat.com> To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org> Sent: Friday, March 14, 2014 11:26:03 AM Subject: Re: [infinispan-dev] Infinispan - Hadoop integration On Mar 14, 2014, at 9:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > > >> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote: >> >>> On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote: >>> >>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org> wrote: >>>> >>>>> On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote: >>>>> >>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com> wrote: >>>>> >>>>>>> - also important to notice that we will have both an Hadoop and an >>>>>>> Infinispan cluster running in parallel: the user will interact with >>>>>>> the former in order to run M/R tasks. Hadoop will use Infinispan >>>>>>> (integration achieved through InputFormat and OutputFormat ) in >>>>>>> order to get the data to be processed. >>>>>> >>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as >>>>>> well -- hence 1JVM? >>>>> >>>>> good point, ideally it should be a single VM: reduced serialization >>>>> cost (in vm access) and simpler architecture. That's if you're not >>>>> using C/S mode, of course. >>>> >>>> ? >>>> Don't try confusing us again on that :-) >>>> I think we agreed that the job would *always* run in strict locality >>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client >>>> would be connecting from somewhere else but that's unrelated. >>> >>> we did discuss the possibility of running it over hotrod though, do you >>> see a problem with that? >> >> No of course not, we discussed that. I just mean I think that needs to >> be clarified on the list that the Hadoop engine will always run in the >> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop >> native clients, or Hadoop clients over Hot Rod) can indeed connect >> remotely, but it's important to clarify that the processing itself >> will take advantage of locality in all configurations. In other words, >> to clarify that the serialization cost you mention for clients is just >> to transfer the job definition and optionally the final processing >> result. >> > > Not quite. The serialization cost Mircea mentions I think is between the > Hadoop vm and the Infinispan vm on a single node. The serialization does > not require network traffic but is still shuffling data between two > processes basically. We could eliminate this by starting both Hadoop and > Infinispan from the same VM but that requires more work than necessary for > a prototype. thanks for the clarification, indeed this is the serialization overhead I had in mind. > > So to clarify, we will have a cluster of nodes where each node contains two > JVM, one running an Hadoop process, one running an Infinispan process. The > Hadoop process would only read the data from the Infinispan process in the > same node during a normal M/R execution. Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

4400

days inactive

4404

days old

infinispan-dev@lists.jboss.org

Manage subscription

13 comments

6 participants

tags (0)

participants (6)

Alan Field
Ales Justin
Emmanuel Bernard
Gustavo Fernandes
Mircea Markus
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Infinispan - Hadoop integration