Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Tuesday, 5 March 2013

We might hope for a stable enough contract on Hibernate Search and
hope that we won't break serializability between micro or minor
versions. That will need to be taken into account in the test suite and
design.
On the OGM side though, we are not at that level of maturity and we will
force homogenous Hibernate OGM version across all the cluster. The grid
will have to go down for upgrades or enforce that no mpa reduce job
using OGM is used while the version roll out is in process.

Emmanuel

On Mon 2013-03-04 18:30, Sanne Grinovero wrote:
...
 Found an example, this is all the code it needs to have a MassIndexer
working
 on top of Infinispan's Map/Reduce:

https://github.com/infinispan/infinispan/blob/master/query/src/main/java/...

 Note it's initialize method which injects needed components; the
 implementation is serialized across nodes.

 Sanne

 On 4 March 2013 18:26, Sanne Grinovero <sanne(a)hibernate.org&gt; wrote:
 > We finished this discussion on IRC, in case someone else was interested:
 >
 > <sanne> hum I forgot the first step.. transformation from entry into entity
 > <sanne> updated
 > <sanne> emmanuel, the "hidrate" step is what DavideD is bashing is
 > head against, but let's assume he finds a workaround and we focus on
 > the pattern as first step?
 > <emmanuel> https://gist.github.com/emmanuelbernard/5084039
 > <emmanuel> sanne: ^ that's how I would do it if I had an Iterator from the
tuple
 > <emmanuel> assuming pushToExecutor pushes to whatever concurrent work
 > mechanism you planned to use on consumes
 > <emmanuel> Plus I am not folloing exactly how you plan consumes(Entry)
 > to be executed concurrently
 > <emmanuel> is that the GridDialect responsibility?
 > <emmanuel> That looks like a lot of work on the dialect's side
 > <sanne> emmanuel, imagine the backend is Infinispan and has some large
 > amount of data per node, plus that each node has its own backend
 > IndexManager (like and ideal sharding)
 > <emmanuel> ie pool mgt and cap +  queuing
 > <sanne> then with your approach the iterator needs to fetch data from
 > all remote nodes, and then enqueue in a local blocking queue which is
 > returning the data to the original owners
 > <sanne> but if you skip that step, you can just forward the statless
 > consumer to each node and have it run on data locality
 > <emmanuel> I was thinking that if you had the luncene index locally on
 > each node you would ahve a different impl of the MassIndexer anyways
 > <emmanuel> that would simply send a command to each local node
 > <sanne> To answer your question: that would be an optional GridDialect
 > responsibility. I would endorse a trivial first draft doing a
 > single-threaded loop.
 > <emmanuel> and have GridDialect.getDataFor() returnlocal data
 > <sanne> The "consumes" implementation can be either implemented with
a
 > simple iterator - as in your design - so I don't think it pushes much
 > complexity to the GridDialect implementor?
 > <sanne> The benefit of the consumer is that *optionally* it can be
 > mapped on the Map phase, and that's trivial if your backend supports
 > Map/Reduce
 > <emmanuel> sanne: I don't follow that soory
 > <emmanuel> how does that make it mappable to the Map phase?
 > <sanne> "public void consume(Entry e) " is a degenerate
(simplified)
 > form of map.
 > <sanne> mm infinispan IDE crashes at the right moment.
 > <emmanuel> I thought Map was about *filtering*
 > <emmanuel> not processing
 > <sanne> you can decide to accept 100% of values (without filtering),
 > but actually you might want to filter on the specified tables only.
 > <sanne> also, the return type doesn't have to match the input type:
 > hence you define a transformation function, which is inherently
 > applied in parallel on all matching entries.
 > <emmanuel> sanne: but then you require the OGM code to be everywhere
 > (ie on each node of the targetNoSQL
 > <emmanuel> to eb able to do tuple -> entity
 > <emmanuel> that's not realistic
 > <emmanuel> assuming your transform phase is about tuple -> entity and
 > some HSearch ops
 > <sanne> yes right
 > <sanne> but isn;t it worth it? it's optional and much more efficient,
 > as you avoid transferring any data.
 > <sanne> btw we often assume all nodes in the grid are equally
 > configured, so having same apps & libraries deployed.
 > <emmanuel> sanne: let me try and summarize what I understand
 > <emmanuel> it's more efficient if you store the Lucene index locally
 > with the data, and if the grid is written in Java or at least can run
 > code in Java including libraries and if you distribute the OGM
 > configuration across the whole grid
 > <emmanuel> Otherwise, it does not make any difference
 > <emmanuel> Also the GridDialect implementation need to know if you are
 > doing this trick to only return local data
 > <sanne> no there are other drawbacks which get defeated, but minor so
 > I didn't mention them
 > <emmanuel> am I right?
 > <sanne> mainly, you skip the need for the contentions point as there
 > is no push to a shared blocking queue
 > <sanne> no the GridDialect doesn't need to know.
 > <emmanuel> sanne: sure if you can process the code on each node you
 > avoid the shared blocking queue, at lest until you reach the
 > IndexManager
 > <sanne> you'll just forward a simple (standard) M/R task, and it will
 > need to execute it as always.
 > <sanne> the IndexManager is parallel ;)
 > <emmanuel> sanne: parallel on a single node
 > <sanne> yes, but no contentions points other than the internal
 > structure of the IW
 > <emmanuel> I mean updating the index for a given table is better done
 > on a singlle node
 > <sanne> IndexWriter
 > <emmanuel> sorry I meant IndexWriter
 > <emmanuel> ah but ou mention perfect sharding
 > <emmanuel> you need cosmological alignment for this shit to happen
 > <sanne> not if we plan for it :)
 > <sanne> you might remember the changes to Segments in the ISPN code,
 > to accomodate index storage consistent with the data locality
 > <sanne> that's expected in 6.0
 > <emmanuel> So gridDialect.getData(Consumer consumer, String.. tables) is
wrong
 > <emmanuel> it's more gridDialect.getData(ConsumerImpl.class, String...
tables)
 > <emmanuel> as you ened to send the Comsumer impl
 > <emmanuel> not simply use it
 > <sanne> hu, it needs a reference to the current SearchFactory at very least
 > <emmanuel> sanne: but you're telling me you send the M/R task
 > <emmanuel> so you need to send the M/R code as well
 > <sanne> yes but here we enter Infinspan specific implementation
 > <sanne> I would register the needed components in Infinispan and use
 > the ServiceRegistry to look them up remotely
 > <sanne> not to mention Infinispan could accomodate a custom command for it
 > <emmanuel> What I am saying is that you don't pass the Consumer
 > *instance* tot he grid dialect but rather the impl, no?
 > <sanne> the impl class definition?
 > <emmanuel> sanne: you tell me. How do I send M/R code today?
 > <emmanuel> certainly not an impl instance
 > <sanne> yes you do
 > <sanne> JBMar will take care of it, including state.
 > <sanne> but in this case that would be wrong of course as I don't want
 > to serialize the whole SearchFactory so I'd use injection and lookup,
 > but that's a detail of Infinispan.
 > <sanne> But this shouldn't be MassIndexer specific right? it's good
to
 > expose a general "execute on all" method, and I think accepting
 > instances would make life easier for most - even though we might need
 > to document some limitations.
 > <emmanuel> alright, I guess 'll have to live with a visitor pattern
 > for a feature that has 5% chance of happening :)
 > <sanne> I'm going to punch Davide
 > <sanne> as he's yelling "it's not a visitor" but doesn't
have the guts
 > to write it down :)
 > <emmanuel> sanne: DavideD 's would have nothing to do about it,
that's
 > requires a lot of config and Infinispan machinery I'm not sure is here
 > today
 > <DavideD> :)
 > <emmanuel> ah
 > <emmanuel> I don't care how it's called, it's one of those
patterns
 > that make the code harder to follow
 > <DavideD> I was actually trying to remember the name of the pattern
 > <sanne> ok now we agree :)
 > <emmanuel> Obfuscator pattern family
 > <sanne> very popular among consultants, I don't understand why you
complain :P
 > <sanne> Anyway, let's wrap up and broaden the horizon:
 > <emmanuel> ok so we are left with findin to to load a entity from a tuple
 > <sanne> you don't think it's useful as a general purpose method?
 > <emmanuel> sanne: wil be for queries
 > <emmanuel> It's just that it's non obvious
 > <sanne> Exactly. Also I think lambda methods are getting widely better known.
 > <emmanuel> syntactically yes
 > <emmanuel> VM wise, perf improvements will come later
 > <sanne> what I mean is that by defining the SPI this way, I don't
 > expect it to be more complex for the GridDialect implementors, while
 > we can reuse it for a wider scope of needs.
 >
 >  --Sanne
 >
 > On 4 March 2013 17:02, Emmanuel Bernard <emmanuel(a)hibernate.org&gt; wrote:
 >>
 >>
 >> On 4 mars 2013, at 17:39, Sanne Grinovero <sanne(a)hibernate.org&gt; wrote:
 >>
 >>> On 4 March 2013 16:20, Emmanuel Bernard <emmanuel(a)hibernate.org&gt;
wrote:
 >>>> I already gave what I knew on how to load an entity from a tuple (which
 >>>> isn't much) but we can try and dig together. Something I thought
about
 >>>> is that ORM probably has a mechanism to load an entity from a resultset
 >>>> via the query parser. And that probably looks also like the second half
 >>>> of OgmLoader.load. We could look at this part and see if we can make an
 >>>> OGM version of it. We never had the need before as we never had query
 >>>> support (the way SQL does it).
 >>>
 >>> I would also need to study the ORM code, but to add a high level
observation,
 >>> the methods currently defined by the GridDialect are focusing on
 >>> loading from well known key instances,
 >>> there is nothing to makes us able to scan/inspect for all values.
 >>>
 >>> In other words: even if we wanted to load keys first, we don't have
definitions
 >>> of functions from raw->primary key instances either.
 >>
 >> I understand that. I'm not denying the need for the method.
 >>
 >>>
 >>>
 >>>> On the visitor vs Iterator approach, I still don't see how
implementing
 >>>> an Iterator on a map / reduce backend would be harder than the visitor
 >>>> but maybe I'm missing something.
 >>>>
 >>>>    class IteratorAsStream {
 >>>>        final Query someMapReduceQuery = ...;
 >>>>
 >>>>        public Object next() {
 >>>>            if (!someMapReduceQuery.started()) {
 >>>>                // execute and collect results in parallel
 >>>>                someMapReduceQuery.execute();
 >>>>            }
 >>>>            Object result = someMapReduce.getNextOrBlock();
 >>>>            return result;
 >>>>        }
 >>>>    }
 >>>
 >>> That could work to *load* all entities in parallel, but I'd like to
 >>> process the entities in parallel as well.
 >>> And I'd rather not force the GridDialect implementors to write some
 >>> Hibernate Search specific code,
 >>> so to break out we need some form of "Execute X on each": a
closure or a lambda.
 >>>
 >>
 >> I can't see how the visitor model helps in your processing of entities in
parallel. To me both approaches are strictly equivalent. Care to show some pseudo-code?

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?