Hey,
First off, I think integrating Infinispan with Hadoop using the InputFormat and
OutputFormat interfaces is a really good idea, instead of using the file system
interfaces. That's the approach that Gluster is using, but I don't think it's
ideal for Infinispan.
So to clarify, we will have a cluster of nodes where each node
contains two
JVM, one running an Hadoop process, one running an Infinispan process. The
Hadoop process would only read the data from the Infinispan process in the
same node during a normal M/R execution.
A regular non-master Hadoop node is running a DataNode daemon and a TaskTracker daemon in
separate JVMs. Each Map/Reduce Task is also executed in a separate JVM. In a Hadoop 2.0,
the TaskTracker daemon is replaced by a NodeManager daemon. This is the minimal number of
JVM processes needed by a node, and there will be more if other services are running in
the cluster. The first point I am trying to make is that there are many different JVM
processes running on a Hadoop node. Then I am trying to understand which JVM process you
are talking about when you say the "Hadoop process"? I *think* this is referring
to the actual Map/Reduce Task, but I'm not sure.
I would also say that trying to figure out how to integrate Infinispan with Apache Spark
[1] would also be interesting. I don't know as much about it, but a lot of processing
that is currently being performed in Map/Reduce will be migrated to Spark. It's
certainly being billed as the replacement, or at least the future of Map/Reduce.
Thanks,
Alan
[1]
https://spark.apache.org/
----- Original Message -----
From: "Mircea Markus" <mmarkus(a)redhat.com>
To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org>
Sent: Friday, March 14, 2014 11:26:03 AM
Subject: Re: [infinispan-dev] Infinispan - Hadoop integration
On Mar 14, 2014, at 9:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:
>
>
>> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote:
>>
>>> On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote:
>>>
>>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org>
wrote:
>>>>
>>>>> On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com>
wrote:
>>>>>
>>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com>
wrote:
>>>>>
>>>>>>> - also important to notice that we will have both an Hadoop
and an
>>>>>>> Infinispan cluster running in parallel: the user will
interact with
>>>>>>> the former in order to run M/R tasks. Hadoop will use
Infinispan
>>>>>>> (integration achieved through InputFormat and OutputFormat )
in
>>>>>>> order to get the data to be processed.
>>>>>>
>>>>>> Would this be 2 JVMs, or you can trick Hadoop to start
Infinispan as
>>>>>> well -- hence 1JVM?
>>>>>
>>>>> good point, ideally it should be a single VM: reduced serialization
>>>>> cost (in vm access) and simpler architecture. That's if
you're not
>>>>> using C/S mode, of course.
>>>>
>>>> ?
>>>> Don't try confusing us again on that :-)
>>>> I think we agreed that the job would *always* run in strict locality
>>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
>>>> would be connecting from somewhere else but that's unrelated.
>>>
>>> we did discuss the possibility of running it over hotrod though, do you
>>> see a problem with that?
>>
>> No of course not, we discussed that. I just mean I think that needs to
>> be clarified on the list that the Hadoop engine will always run in the
>> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
>> native clients, or Hadoop clients over Hot Rod) can indeed connect
>> remotely, but it's important to clarify that the processing itself
>> will take advantage of locality in all configurations. In other words,
>> to clarify that the serialization cost you mention for clients is just
>> to transfer the job definition and optionally the final processing
>> result.
>>
>
> Not quite. The serialization cost Mircea mentions I think is between the
> Hadoop vm and the Infinispan vm on a single node. The serialization does
> not require network traffic but is still shuffling data between two
> processes basically. We could eliminate this by starting both Hadoop and
> Infinispan from the same VM but that requires more work than necessary for
> a prototype.
thanks for the clarification, indeed this is the serialization overhead I had
in mind.
>
> So to clarify, we will have a cluster of nodes where each node contains two
> JVM, one running an Hadoop process, one running an Infinispan process. The
> Hadoop process would only read the data from the Infinispan process in the
> same node during a normal M/R execution.
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev