[hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Mon Mar 4 13:30:10 EST 2013

Found an example, this is all the code it needs to have a MassIndexer working
on top of Infinispan's Map/Reduce:

https://github.com/infinispan/infinispan/blob/master/query/src/main/java/org/infinispan/query/impl/massindex/IndexingMapper.java#L40

Note it's initialize method which injects needed components; the
implementation is serialized across nodes.

Sanne

On 4 March 2013 18:26, Sanne Grinovero <sanne at hibernate.org> wrote:
> We finished this discussion on IRC, in case someone else was interested:
>
> <sanne> hum I forgot the first step.. transformation from entry into entity
> <sanne> updated
> <sanne> emmanuel, the "hidrate" step is what DavideD is bashing is
> head against, but let's assume he finds a workaround and we focus on
> the pattern as first step?
> <emmanuel> https://gist.github.com/emmanuelbernard/5084039
> <emmanuel> sanne: ^ that's how I would do it if I had an Iterator from the tuple
> <emmanuel> assuming pushToExecutor pushes to whatever concurrent work
> mechanism you planned to use on consumes
> <emmanuel> Plus I am not folloing exactly how you plan consumes(Entry)
> to be executed concurrently
> <emmanuel> is that the GridDialect responsibility?
> <emmanuel> That looks like a lot of work on the dialect's side
> <sanne> emmanuel, imagine the backend is Infinispan and has some large
> amount of data per node, plus that each node has its own backend
> IndexManager (like and ideal sharding)
> <emmanuel> ie pool mgt and cap +  queuing
> <sanne> then with your approach the iterator needs to fetch data from
> all remote nodes, and then enqueue in a local blocking queue which is
> returning the data to the original owners
> <sanne> but if you skip that step, you can just forward the statless
> consumer to each node and have it run on data locality
> <emmanuel> I was thinking that if you had the luncene index locally on
> each node you would ahve a different impl of the MassIndexer anyways
> <emmanuel> that would simply send a command to each local node
> <sanne> To answer your question: that would be an optional GridDialect
> responsibility. I would endorse a trivial first draft doing a
> single-threaded loop.
> <emmanuel> and have GridDialect.getDataFor() returnlocal data
> <sanne> The "consumes" implementation can be either implemented with a
> simple iterator - as in your design - so I don't think it pushes much
> complexity to the GridDialect implementor?
> <sanne> The benefit of the consumer is that *optionally* it can be
> mapped on the Map phase, and that's trivial if your backend supports
> Map/Reduce
> <emmanuel> sanne: I don't follow that soory
> <emmanuel> how does that make it mappable to the Map phase?
> <sanne> "public void consume(Entry e) " is a degenerate (simplified)
> form of map.
> <sanne> mm infinispan IDE crashes at the right moment.
> <emmanuel> I thought Map was about *filtering*
> <emmanuel> not processing
> <sanne> you can decide to accept 100% of values (without filtering),
> but actually you might want to filter on the specified tables only.
> <sanne> also, the return type doesn't have to match the input type:
> hence you define a transformation function, which is inherently
> applied in parallel on all matching entries.
> <emmanuel> sanne: but then you require the OGM code to be everywhere
> (ie on each node of the targetNoSQL
> <emmanuel> to eb able to do tuple -> entity
> <emmanuel> that's not realistic
> <emmanuel> assuming your transform phase is about tuple -> entity and
> some HSearch ops
> <sanne> yes right
> <sanne> but isn;t it worth it? it's optional and much more efficient,
> as you avoid transferring any data.
> <sanne> btw we often assume all nodes in the grid are equally
> configured, so having same apps & libraries deployed.
> <emmanuel> sanne: let me try and summarize what I understand
> <emmanuel> it's more efficient if you store the Lucene index locally
> with the data, and if the grid is written in Java or at least can run
> code in Java including libraries and if you distribute the OGM
> configuration across the whole grid
> <emmanuel> Otherwise, it does not make any difference
> <emmanuel> Also the GridDialect implementation need to know if you are
> doing this trick to only return local data
> <sanne> no there are other drawbacks which get defeated, but minor so
> I didn't mention them
> <emmanuel> am I right?
> <sanne> mainly, you skip the need for the contentions point as there
> is no push to a shared blocking queue
> <sanne> no the GridDialect doesn't need to know.
> <emmanuel> sanne: sure if you can process the code on each node you
> avoid the shared blocking queue, at lest until you reach the
> IndexManager
> <sanne> you'll just forward a simple (standard) M/R task, and it will
> need to execute it as always.
> <sanne> the IndexManager is parallel ;)
> <emmanuel> sanne: parallel on a single node
> <sanne> yes, but no contentions points other than the internal
> structure of the IW
> <emmanuel> I mean updating the index for a given table is better done
> on a singlle node
> <sanne> IndexWriter
> <emmanuel> sorry I meant IndexWriter
> <emmanuel> ah but ou mention perfect sharding
> <emmanuel> you need cosmological alignment for this shit to happen
> <sanne> not if we plan for it :)
> <sanne> you might remember the changes to Segments in the ISPN code,
> to accomodate index storage consistent with the data locality
> <sanne> that's expected in 6.0
> <emmanuel> So gridDialect.getData(Consumer consumer, String.. tables) is wrong
> <emmanuel> it's more gridDialect.getData(ConsumerImpl.class, String... tables)
> <emmanuel> as you ened to send the Comsumer impl
> <emmanuel> not simply use it
> <sanne> hu, it needs a reference to the current SearchFactory at very least
> <emmanuel> sanne: but you're telling me you send the M/R task
> <emmanuel> so you need to send the M/R code as well
> <sanne> yes but here we enter Infinspan specific implementation
> <sanne> I would register the needed components in Infinispan and use
> the ServiceRegistry to look them up remotely
> <sanne> not to mention Infinispan could accomodate a custom command for it
> <emmanuel> What I am saying is that you don't pass the Consumer
> *instance* tot he grid dialect but rather the impl, no?
> <sanne> the impl class definition?
> <emmanuel> sanne: you tell me. How do I send M/R code today?
> <emmanuel> certainly not an impl instance
> <sanne> yes you do
> <sanne> JBMar will take care of it, including state.
> <sanne> but in this case that would be wrong of course as I don't want
> to serialize the whole SearchFactory so I'd use injection and lookup,
> but that's a detail of Infinispan.
> <sanne> But this shouldn't be MassIndexer specific right? it's good to
> expose a general "execute on all" method, and I think accepting
> instances would make life easier for most - even though we might need
> to document some limitations.
> <emmanuel> alright, I guess 'll have to live with a visitor pattern
> for a feature that has 5% chance of happening :)
> <sanne> I'm going to punch Davide
> <sanne> as he's yelling "it's not a visitor" but doesn't have the guts
> to write it down :)
> <emmanuel> sanne: DavideD 's would have nothing to do about it, that's
> requires a lot of config and Infinispan machinery I'm not sure is here
> today
> <DavideD> :)
> <emmanuel> ah
> <emmanuel> I don't care how it's called, it's one of those patterns
> that make the code harder to follow
> <DavideD> I was actually trying to remember the name of the pattern
> <sanne> ok now we agree :)
> <emmanuel> Obfuscator pattern family
> <sanne> very popular among consultants, I don't understand why you complain :P
> <sanne> Anyway, let's wrap up and broaden the horizon:
> <emmanuel> ok so we are left with findin to to load a entity from a tuple
> <sanne> you don't think it's useful as a general purpose method?
> <emmanuel> sanne: wil be for queries
> <emmanuel> It's just that it's non obvious
> <sanne> Exactly. Also I think lambda methods are getting widely better known.
> <emmanuel> syntactically yes
> <emmanuel> VM wise, perf improvements will come later
> <sanne> what I mean is that by defining the SPI this way, I don't
> expect it to be more complex for the GridDialect implementors, while
> we can reuse it for a wider scope of needs.
>
>  --Sanne
>
> On 4 March 2013 17:02, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>>
>>
>> On 4 mars 2013, at 17:39, Sanne Grinovero <sanne at hibernate.org> wrote:
>>
>>> On 4 March 2013 16:20, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
>>>> I already gave what I knew on how to load an entity from a tuple (which
>>>> isn't much) but we can try and dig together. Something I thought about
>>>> is that ORM probably has a mechanism to load an entity from a resultset
>>>> via the query parser. And that probably looks also like the second half
>>>> of OgmLoader.load. We could look at this part and see if we can make an
>>>> OGM version of it. We never had the need before as we never had query
>>>> support (the way SQL does it).
>>>
>>> I would also need to study the ORM code, but to add a high level observation,
>>> the methods currently defined by the GridDialect are focusing on
>>> loading from well known key instances,
>>> there is nothing to makes us able to scan/inspect for all values.
>>>
>>> In other words: even if we wanted to load keys first, we don't have definitions
>>> of functions from raw->primary key instances either.
>>
>> I understand that. I'm not denying the need for the method.
>>
>>>
>>>
>>>> On the visitor vs Iterator approach, I still don't see how implementing
>>>> an Iterator on a map / reduce backend would be harder than the visitor
>>>> but maybe I'm missing something.
>>>>
>>>>    class IteratorAsStream {
>>>>        final Query someMapReduceQuery = ...;
>>>>
>>>>        public Object next() {
>>>>            if (!someMapReduceQuery.started()) {
>>>>                // execute and collect results in parallel
>>>>                someMapReduceQuery.execute();
>>>>            }
>>>>            Object result = someMapReduce.getNextOrBlock();
>>>>            return result;
>>>>        }
>>>>    }
>>>
>>> That could work to *load* all entities in parallel, but I'd like to
>>> process the entities in parallel as well.
>>> And I'd rather not force the GridDialect implementors to write some
>>> Hibernate Search specific code,
>>> so to break out we need some form of "Execute X on each": a closure or a lambda.
>>>
>>
>> I can't see how the visitor model helps in your processing of entities in parallel. To me both approaches are strictly equivalent. Care to show some pseudo-code?