[infinispan-dev] Infinispan - Hadoop integration

Fri Mar 14 07:34:37 EDT 2014

On 14 March 2014 09:06, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>
>
>> On 13 mars 2014, at 23:39, Sanne Grinovero <sanne at infinispan.org> wrote:
>>
>>> On 13 March 2014 22:19, Mircea Markus <mmarkus at redhat.com> wrote:
>>>
>>>> On Mar 13, 2014, at 22:17, Sanne Grinovero <sanne at infinispan.org> wrote:
>>>>
>>>>> On 13 March 2014 22:05, Mircea Markus <mmarkus at redhat.com> wrote:
>>>>>
>>>>> On Mar 13, 2014, at 20:59, Ales Justin <ales.justin at gmail.com> wrote:
>>>>>
>>>>>>> - also important to notice that we will have both an Hadoop and an Infinispan cluster running in parallel: the user will interact with the former in order to run M/R tasks. Hadoop will use Infinispan (integration achieved through InputFormat and OutputFormat ) in order to get the data to be processed.
>>>>>>
>>>>>> Would this be 2 JVMs, or you can trick Hadoop to start Infinispan as well -- hence 1JVM?
>>>>>
>>>>> good point, ideally it should be a single VM: reduced serialization cost (in vm access) and simpler architecture. That's if you're not using C/S mode, of course.
>>>>
>>>> ?
>>>> Don't try confusing us again on that :-)
>>>> I think we agreed that the job would *always* run in strict locality
>>>> with the datacontainer (i.e. in the same JVM). Sure, an Hadoop client
>>>> would be connecting from somewhere else but that's unrelated.
>>>
>>> we did discuss the possibility of running it over hotrod though, do you see a problem with that?
>>
>> No of course not, we discussed that. I just mean I think that needs to
>> be clarified on the list that the Hadoop engine will always run in the
>> same JVM. Clients (be it Hot Rod via new custom commands or Hadoop
>> native clients, or Hadoop clients over Hot Rod) can indeed connect
>> remotely, but it's important to clarify that the processing itself
>> will take advantage of locality in all configurations. In other words,
>> to clarify that the serialization cost you mention for clients is just
>> to transfer the job definition and optionally the final processing
>> result.
>>
>
> Not quite. The serialization cost Mircea mentions I think is between the Hadoop vm and the Infinispan vm on a single node. The serialization does not require network traffic but is still shuffling data between two processes basically. We could eliminate this by starting both Hadoop and Infinispan from the same VM but that requires more work than necessary for a prototype.

Ok so there was indeed confusion on terminology: I don't agree with that design.
>From an implementor's effort perspective having to setup an Hot Rod
client rather than embedding an Infinispan node is approximately the
same work, or slightly more as you have to start both. Also to test
it, embedded mode it easier.

Hot Rod is not meant to be used on the same node, especially not if
you only want to access data in strict locality; for example it
wouldn't be able to iterated on all keys of the current server node
(and limiting to those keys only). I might be wrong as I'm not too
familiar with Hot Rod, but I think it might not even be able to
iterate on keys at all; maybe today it can actually via some trick,
but the point is this is a conceptual mismatch for it.

Where you say this doesn't require nework traffic you need to consider
that while it's true this might not be using the physical network wire
being localhost, it would still be transferred over a costly network
stream, as we don't do off-heap buffer sharing yet.

> So to clarify, we will have a cluster of nodes where each node contains two JVM, one running an Hadoop process, one running an Infinispan process. The Hadoop process would only read the data from the Infinispan process in the same node during a normal M/R execution.

So we discussed two use cases:
- engage Infinispan to accelerate an existing Hadoop deployment
- engage Hadoop to run an Hadoop job on existing data in Infinispan
In neither case I see why I'd run them in separate JVMs: seems less
effective and more work to get done, and no benefit unless you're
thinking about independent JVM tuning? That might be something to
consider, but I doubt tuning independence would ever offset the cost
of serialized transfer of each entry.

The second use case could be used via Hot Rod too, but that's a
different discussion, actually just a nice side effect of Hadoop being
language agnostic that we would take advantage of.

Sanne

> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev