On Mar 14, 2014, at 9:06, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:
> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne(a)infinispan.org> wrote:
>
>> On 13 March 2014 22:19, Mircea Markus <mmarkus(a)redhat.com> wrote:
>>
>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne(a)infinispan.org>
wrote:
>>>
>>>> On 13 March 2014 22:05, Mircea Markus <mmarkus(a)redhat.com> wrote:
>>>>
>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin(a)gmail.com>
wrote:
>>>>
>>>>>> - also important to notice that we will have both an Hadoop and
an Infinispan cluster running in parallel: the user will interact with the former in order
to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and
OutputFormat ) in order to get the data to be processed.
>>>>>
>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as
well -- hence 1JVM?
>>>>
>>>> good point, ideally it should be a single VM: reduced serialization cost
(in vm access) and simpler architecture. That's if you're not using C/S mode, of
course.
>>>
>>> ?
>>> Don't try confusing us again on that :-)
>>> I think we agreed that the job would *always* run in strict locality
>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
>>> would be connecting from somewhere else but that's unrelated.
>>
>> we did discuss the possibility of running it over hotrod though, do you see a
problem with that?
>
> No of course not, we discussed that. I just mean I think that needs to
> be clarified on the list that the Hadoop engine will always run in the
> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
> native clients, or Hadoop clients over Hot Rod) can indeed connect
> remotely, but it's important to clarify that the processing itself
> will take advantage of locality in all configurations. In other words,
> to clarify that the serialization cost you mention for clients is just
> to transfer the job definition and optionally the final processing
> result.
>
Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and
the Infinispan vm on a single node. The serialization does not require network traffic but
is still shuffling data between two processes basically. We could eliminate this by
starting both Hadoop and Infinispan from the same VM but that requires more work than
necessary for a prototype.
thanks for the clarification, indeed this is the serialization overhead I had in mind.
So to clarify, we will have a cluster of nodes where each node contains two JVM, one
running an Hadoop process, one running an Infinispan process. The Hadoop process would
only read the data from the Infinispan process in the same node during a normal M/R
execution.
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)