[hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Mon Mar 4 13:26:22 EST 2013

We finished this discussion on IRC, in case someone else was interested:

<sanne> hum I forgot the first step.. transformation from entry into entity
<sanne> updated
<sanne> emmanuel, the "hidrate" step is what DavideD is bashing is
head against, but let's assume he finds a workaround and we focus on
the pattern as first step?
<emmanuel> https://gist.github.com/emmanuelbernard/5084039
<emmanuel> sanne: ^ that's how I would do it if I had an Iterator from the tuple
<emmanuel> assuming pushToExecutor pushes to whatever concurrent work
mechanism you planned to use on consumes
<emmanuel> Plus I am not folloing exactly how you plan consumes(Entry)
to be executed concurrently
<emmanuel> is that the GridDialect responsibility?
<emmanuel> That looks like a lot of work on the dialect's side
<sanne> emmanuel, imagine the backend is Infinispan and has some large
amount of data per node, plus that each node has its own backend
IndexManager (like and ideal sharding)
<emmanuel> ie pool mgt and cap +  queuing
<sanne> then with your approach the iterator needs to fetch data from
all remote nodes, and then enqueue in a local blocking queue which is
returning the data to the original owners
<sanne> but if you skip that step, you can just forward the statless
consumer to each node and have it run on data locality
<emmanuel> I was thinking that if you had the luncene index locally on
each node you would ahve a different impl of the MassIndexer anyways
<emmanuel> that would simply send a command to each local node
<sanne> To answer your question: that would be an optional GridDialect
responsibility. I would endorse a trivial first draft doing a
single-threaded loop.
<emmanuel> and have GridDialect.getDataFor() returnlocal data
<sanne> The "consumes" implementation can be either implemented with a
simple iterator - as in your design - so I don't think it pushes much
complexity to the GridDialect implementor?
<sanne> The benefit of the consumer is that *optionally* it can be
mapped on the Map phase, and that's trivial if your backend supports
Map/Reduce
<emmanuel> sanne: I don't follow that soory
<emmanuel> how does that make it mappable to the Map phase?
<sanne> "public void consume(Entry e) " is a degenerate (simplified)
form of map.
<sanne> mm infinispan IDE crashes at the right moment.
<emmanuel> I thought Map was about *filtering*
<emmanuel> not processing
<sanne> you can decide to accept 100% of values (without filtering),
but actually you might want to filter on the specified tables only.
<sanne> also, the return type doesn't have to match the input type:
hence you define a transformation function, which is inherently
applied in parallel on all matching entries.
<emmanuel> sanne: but then you require the OGM code to be everywhere
(ie on each node of the targetNoSQL
<emmanuel> to eb able to do tuple -> entity
<emmanuel> that's not realistic
<emmanuel> assuming your transform phase is about tuple -> entity and
some HSearch ops
<sanne> yes right
<sanne> but isn;t it worth it? it's optional and much more efficient,
as you avoid transferring any data.
<sanne> btw we often assume all nodes in the grid are equally
configured, so having same apps & libraries deployed.
<emmanuel> sanne: let me try and summarize what I understand
<emmanuel> it's more efficient if you store the Lucene index locally
with the data, and if the grid is written in Java or at least can run
code in Java including libraries and if you distribute the OGM
configuration across the whole grid
<emmanuel> Otherwise, it does not make any difference
<emmanuel> Also the GridDialect implementation need to know if you are
doing this trick to only return local data
<sanne> no there are other drawbacks which get defeated, but minor so
I didn't mention them
<emmanuel> am I right?
<sanne> mainly, you skip the need for the contentions point as there
is no push to a shared blocking queue
<sanne> no the GridDialect doesn't need to know.
<emmanuel> sanne: sure if you can process the code on each node you
avoid the shared blocking queue, at lest until you reach the
IndexManager
<sanne> you'll just forward a simple (standard) M/R task, and it will
need to execute it as always.
<sanne> the IndexManager is parallel ;)
<emmanuel> sanne: parallel on a single node
<sanne> yes, but no contentions points other than the internal
structure of the IW
<emmanuel> I mean updating the index for a given table is better done
on a singlle node
<sanne> IndexWriter
<emmanuel> sorry I meant IndexWriter
<emmanuel> ah but ou mention perfect sharding
<emmanuel> you need cosmological alignment for this shit to happen
<sanne> not if we plan for it :)
<sanne> you might remember the changes to Segments in the ISPN code,
to accomodate index storage consistent with the data locality
<sanne> that's expected in 6.0
<emmanuel> So gridDialect.getData(Consumer consumer, String.. tables) is wrong
<emmanuel> it's more gridDialect.getData(ConsumerImpl.class, String... tables)
<emmanuel> as you ened to send the Comsumer impl
<emmanuel> not simply use it
<sanne> hu, it needs a reference to the current SearchFactory at very least
<emmanuel> sanne: but you're telling me you send the M/R task
<emmanuel> so you need to send the M/R code as well
<sanne> yes but here we enter Infinspan specific implementation
<sanne> I would register the needed components in Infinispan and use
the ServiceRegistry to look them up remotely
<sanne> not to mention Infinispan could accomodate a custom command for it
<emmanuel> What I am saying is that you don't pass the Consumer
*instance* tot he grid dialect but rather the impl, no?
<sanne> the impl class definition?
<emmanuel> sanne: you tell me. How do I send M/R code today?
<emmanuel> certainly not an impl instance
<sanne> yes you do
<sanne> JBMar will take care of it, including state.
<sanne> but in this case that would be wrong of course as I don't want
to serialize the whole SearchFactory so I'd use injection and lookup,
but that's a detail of Infinispan.
<sanne> But this shouldn't be MassIndexer specific right? it's good to
expose a general "execute on all" method, and I think accepting
instances would make life easier for most - even though we might need
to document some limitations.
<emmanuel> alright, I guess 'll have to live with a visitor pattern
for a feature that has 5% chance of happening :)
<sanne> I'm going to punch Davide
<sanne> as he's yelling "it's not a visitor" but doesn't have the guts
to write it down :)
<emmanuel> sanne: DavideD 's would have nothing to do about it, that's
requires a lot of config and Infinispan machinery I'm not sure is here
today
<DavideD> :)
<emmanuel> ah
<emmanuel> I don't care how it's called, it's one of those patterns
that make the code harder to follow
<DavideD> I was actually trying to remember the name of the pattern
<sanne> ok now we agree :)
<emmanuel> Obfuscator pattern family
<sanne> very popular among consultants, I don't understand why you complain :P
<sanne> Anyway, let's wrap up and broaden the horizon:
<emmanuel> ok so we are left with findin to to load a entity from a tuple
<sanne> you don't think it's useful as a general purpose method?
<emmanuel> sanne: wil be for queries
<emmanuel> It's just that it's non obvious
<sanne> Exactly. Also I think lambda methods are getting widely better known.
<emmanuel> syntactically yes
<emmanuel> VM wise, perf improvements will come later
<sanne> what I mean is that by defining the SPI this way, I don't
expect it to be more complex for the GridDialect implementors, while
we can reuse it for a wider scope of needs.

 --Sanne

On 4 March 2013 17:02, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>
>
> On 4 mars 2013, at 17:39, Sanne Grinovero <sanne at hibernate.org> wrote:
>
>> On 4 March 2013 16:20, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>>> I already gave what I knew on how to load an entity from a tuple (which
>>> isn't much) but we can try and dig together. Something I thought about
>>> is that ORM probably has a mechanism to load an entity from a resultset
>>> via the query parser. And that probably looks also like the second half
>>> of OgmLoader.load. We could look at this part and see if we can make an
>>> OGM version of it. We never had the need before as we never had query
>>> support (the way SQL does it).
>>
>> I would also need to study the ORM code, but to add a high level observation,
>> the methods currently defined by the GridDialect are focusing on
>> loading from well known key instances,
>> there is nothing to makes us able to scan/inspect for all values.
>>
>> In other words: even if we wanted to load keys first, we don't have definitions
>> of functions from raw->primary key instances either.
>
> I understand that. I'm not denying the need for the method.
>
>>
>>
>>> On the visitor vs Iterator approach, I still don't see how implementing
>>> an Iterator on a map / reduce backend would be harder than the visitor
>>> but maybe I'm missing something.
>>>
>>>    class IteratorAsStream {
>>>        final Query someMapReduceQuery = ...;
>>>
>>>        public Object next() {
>>>            if (!someMapReduceQuery.started()) {
>>>                // execute and collect results in parallel
>>>                someMapReduceQuery.execute();
>>>            }
>>>            Object result = someMapReduce.getNextOrBlock();
>>>            return result;
>>>        }
>>>    }
>>
>> That could work to *load* all entities in parallel, but I'd like to
>> process the entities in parallel as well.
>> And I'd rather not force the GridDialect implementors to write some
>> Hibernate Search specific code,
>> so to break out we need some form of "Execute X on each": a closure or a lambda.
>>
>
> I can't see how the visitor model helps in your processing of entities in parallel. To me both approaches are strictly equivalent. Care to show some pseudo-code?