[hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Wed Mar 6 13:18:21 EST 2013

I've successfully implemented OGM-151 for EntityKey which is the one we
need to move OGM-273 forward for now.
I am trying to implement it for AssociationKey but caching here is
significantly harder as data is cross reference across associations.

Sanne, when you worked on the profiling of OGM, do you remember
AssociationKey putting a pressure in build time or memory wise? Because
caching them per persister means some rather complex race conditions and
more memory used permanently (as opposed to on demand).

So I'm wondering if that's worth it. As an intermediary step, I could
introduce AssociationKeyMetadata but build it on-demand - that one is
easier to achieve.

Emmanuel

On Wed 2013-03-06 15:32, Davide D'Alto wrote:
> it's ok for me
> 
> Davide
> 
> On Wed, Mar 6, 2013 at 3:28 PM, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> > I'm planning on working on OGM-151. Fine with everyone?
> > That will likely be my last before I move back to BVAL and close the
> > final issues there.
> >
> > Emmanuel
> >
> > On Tue 2013-03-05 19:04, Sanne Grinovero wrote:
> >> Nice!
> >> n+1 is something Hibernate Search has to deal with too, that's why I
> >> was interested in the fetch profiles and graph loading in JPA 2.1
> >>
> >> On 5 March 2013 17:44, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> >> > I have implemented a solution that gives an entity based on a tuple.
> >> > https://hibernate.onjira.com/browse/OGM-273#comment-50082
> >> >
> >> > Note that it does not currently works for MongoDB, but that's waiting
> >> > for the dedicated GridDialect method as well as OGM-151.
> >> > Also note that I have no idea how that will work for associations. I
> >> > suspect some nasty n+1 is happening as best. Worse case, an exception :)
> >> >
> >> > Emmanuel
> >> >
> >> > On Tue 2013-03-05 10:30, Emmanuel Bernard wrote:
> >> >> We might hope for a stable enough contract on Hibernate Search and
> >> >> hope that we won't break serializability between micro or minor
> >> >> versions. That will need to be taken into account in the test suite and
> >> >> design.
> >> >> On the OGM side though, we are not at that level of maturity and we will
> >> >> force homogenous Hibernate OGM version across all the cluster. The grid
> >> >> will have to go down for upgrades or enforce that no mpa reduce job
> >> >> using OGM is used while the version roll out is in process.
> >> >>
> >> >> Emmanuel
> >> >>
> >> >> On Mon 2013-03-04 18:30, Sanne Grinovero wrote:
> >> >> > Found an example, this is all the code it needs to have a MassIndexer working
> >> >> > on top of Infinispan's Map/Reduce:
> >> >> >
> >> >> > https://github.com/infinispan/infinispan/blob/master/query/src/main/java/org/infinispan/query/impl/massindex/IndexingMapper.java#L40
> >> >> >
> >> >> > Note it's initialize method which injects needed components; the
> >> >> > implementation is serialized across nodes.
> >> >> >
> >> >> > Sanne
> >> >> >
> >> >> > On 4 March 2013 18:26, Sanne Grinovero <sanne at hibernate.org> wrote:
> >> >> > > We finished this discussion on IRC, in case someone else was interested:
> >> >> > >
> >> >> > > <sanne> hum I forgot the first step.. transformation from entry into entity
> >> >> > > <sanne> updated
> >> >> > > <sanne> emmanuel, the "hidrate" step is what DavideD is bashing is
> >> >> > > head against, but let's assume he finds a workaround and we focus on
> >> >> > > the pattern as first step?
> >> >> > > <emmanuel> https://gist.github.com/emmanuelbernard/5084039
> >> >> > > <emmanuel> sanne: ^ that's how I would do it if I had an Iterator from the tuple
> >> >> > > <emmanuel> assuming pushToExecutor pushes to whatever concurrent work
> >> >> > > mechanism you planned to use on consumes
> >> >> > > <emmanuel> Plus I am not folloing exactly how you plan consumes(Entry)
> >> >> > > to be executed concurrently
> >> >> > > <emmanuel> is that the GridDialect responsibility?
> >> >> > > <emmanuel> That looks like a lot of work on the dialect's side
> >> >> > > <sanne> emmanuel, imagine the backend is Infinispan and has some large
> >> >> > > amount of data per node, plus that each node has its own backend
> >> >> > > IndexManager (like and ideal sharding)
> >> >> > > <emmanuel> ie pool mgt and cap +  queuing
> >> >> > > <sanne> then with your approach the iterator needs to fetch data from
> >> >> > > all remote nodes, and then enqueue in a local blocking queue which is
> >> >> > > returning the data to the original owners
> >> >> > > <sanne> but if you skip that step, you can just forward the statless
> >> >> > > consumer to each node and have it run on data locality
> >> >> > > <emmanuel> I was thinking that if you had the luncene index locally on
> >> >> > > each node you would ahve a different impl of the MassIndexer anyways
> >> >> > > <emmanuel> that would simply send a command to each local node
> >> >> > > <sanne> To answer your question: that would be an optional GridDialect
> >> >> > > responsibility. I would endorse a trivial first draft doing a
> >> >> > > single-threaded loop.
> >> >> > > <emmanuel> and have GridDialect.getDataFor() returnlocal data
> >> >> > > <sanne> The "consumes" implementation can be either implemented with a
> >> >> > > simple iterator - as in your design - so I don't think it pushes much
> >> >> > > complexity to the GridDialect implementor?
> >> >> > > <sanne> The benefit of the consumer is that *optionally* it can be
> >> >> > > mapped on the Map phase, and that's trivial if your backend supports
> >> >> > > Map/Reduce
> >> >> > > <emmanuel> sanne: I don't follow that soory
> >> >> > > <emmanuel> how does that make it mappable to the Map phase?
> >> >> > > <sanne> "public void consume(Entry e) " is a degenerate (simplified)
> >> >> > > form of map.
> >> >> > > <sanne> mm infinispan IDE crashes at the right moment.
> >> >> > > <emmanuel> I thought Map was about *filtering*
> >> >> > > <emmanuel> not processing
> >> >> > > <sanne> you can decide to accept 100% of values (without filtering),
> >> >> > > but actually you might want to filter on the specified tables only.
> >> >> > > <sanne> also, the return type doesn't have to match the input type:
> >> >> > > hence you define a transformation function, which is inherently
> >> >> > > applied in parallel on all matching entries.
> >> >> > > <emmanuel> sanne: but then you require the OGM code to be everywhere
> >> >> > > (ie on each node of the targetNoSQL
> >> >> > > <emmanuel> to eb able to do tuple -> entity
> >> >> > > <emmanuel> that's not realistic
> >> >> > > <emmanuel> assuming your transform phase is about tuple -> entity and
> >> >> > > some HSearch ops
> >> >> > > <sanne> yes right
> >> >> > > <sanne> but isn;t it worth it? it's optional and much more efficient,
> >> >> > > as you avoid transferring any data.
> >> >> > > <sanne> btw we often assume all nodes in the grid are equally
> >> >> > > configured, so having same apps & libraries deployed.
> >> >> > > <emmanuel> sanne: let me try and summarize what I understand
> >> >> > > <emmanuel> it's more efficient if you store the Lucene index locally
> >> >> > > with the data, and if the grid is written in Java or at least can run
> >> >> > > code in Java including libraries and if you distribute the OGM
> >> >> > > configuration across the whole grid
> >> >> > > <emmanuel> Otherwise, it does not make any difference
> >> >> > > <emmanuel> Also the GridDialect implementation need to know if you are
> >> >> > > doing this trick to only return local data
> >> >> > > <sanne> no there are other drawbacks which get defeated, but minor so
> >> >> > > I didn't mention them
> >> >> > > <emmanuel> am I right?
> >> >> > > <sanne> mainly, you skip the need for the contentions point as there
> >> >> > > is no push to a shared blocking queue
> >> >> > > <sanne> no the GridDialect doesn't need to know.
> >> >> > > <emmanuel> sanne: sure if you can process the code on each node you
> >> >> > > avoid the shared blocking queue, at lest until you reach the
> >> >> > > IndexManager
> >> >> > > <sanne> you'll just forward a simple (standard) M/R task, and it will
> >> >> > > need to execute it as always.
> >> >> > > <sanne> the IndexManager is parallel ;)
> >> >> > > <emmanuel> sanne: parallel on a single node
> >> >> > > <sanne> yes, but no contentions points other than the internal
> >> >> > > structure of the IW
> >> >> > > <emmanuel> I mean updating the index for a given table is better done
> >> >> > > on a singlle node
> >> >> > > <sanne> IndexWriter
> >> >> > > <emmanuel> sorry I meant IndexWriter
> >> >> > > <emmanuel> ah but ou mention perfect sharding
> >> >> > > <emmanuel> you need cosmological alignment for this shit to happen
> >> >> > > <sanne> not if we plan for it :)
> >> >> > > <sanne> you might remember the changes to Segments in the ISPN code,
> >> >> > > to accomodate index storage consistent with the data locality
> >> >> > > <sanne> that's expected in 6.0
> >> >> > > <emmanuel> So gridDialect.getData(Consumer consumer, String.. tables) is wrong
> >> >> > > <emmanuel> it's more gridDialect.getData(ConsumerImpl.class, String... tables)
> >> >> > > <emmanuel> as you ened to send the Comsumer impl
> >> >> > > <emmanuel> not simply use it
> >> >> > > <sanne> hu, it needs a reference to the current SearchFactory at very least
> >> >> > > <emmanuel> sanne: but you're telling me you send the M/R task
> >> >> > > <emmanuel> so you need to send the M/R code as well
> >> >> > > <sanne> yes but here we enter Infinspan specific implementation
> >> >> > > <sanne> I would register the needed components in Infinispan and use
> >> >> > > the ServiceRegistry to look them up remotely
> >> >> > > <sanne> not to mention Infinispan could accomodate a custom command for it
> >> >> > > <emmanuel> What I am saying is that you don't pass the Consumer
> >> >> > > *instance* tot he grid dialect but rather the impl, no?
> >> >> > > <sanne> the impl class definition?
> >> >> > > <emmanuel> sanne: you tell me. How do I send M/R code today?
> >> >> > > <emmanuel> certainly not an impl instance
> >> >> > > <sanne> yes you do
> >> >> > > <sanne> JBMar will take care of it, including state.
> >> >> > > <sanne> but in this case that would be wrong of course as I don't want
> >> >> > > to serialize the whole SearchFactory so I'd use injection and lookup,
> >> >> > > but that's a detail of Infinispan.
> >> >> > > <sanne> But this shouldn't be MassIndexer specific right? it's good to
> >> >> > > expose a general "execute on all" method, and I think accepting
> >> >> > > instances would make life easier for most - even though we might need
> >> >> > > to document some limitations.
> >> >> > > <emmanuel> alright, I guess 'll have to live with a visitor pattern
> >> >> > > for a feature that has 5% chance of happening :)
> >> >> > > <sanne> I'm going to punch Davide
> >> >> > > <sanne> as he's yelling "it's not a visitor" but doesn't have the guts
> >> >> > > to write it down :)
> >> >> > > <emmanuel> sanne: DavideD 's would have nothing to do about it, that's
> >> >> > > requires a lot of config and Infinispan machinery I'm not sure is here
> >> >> > > today
> >> >> > > <DavideD> :)
> >> >> > > <emmanuel> ah
> >> >> > > <emmanuel> I don't care how it's called, it's one of those patterns
> >> >> > > that make the code harder to follow
> >> >> > > <DavideD> I was actually trying to remember the name of the pattern
> >> >> > > <sanne> ok now we agree :)
> >> >> > > <emmanuel> Obfuscator pattern family
> >> >> > > <sanne> very popular among consultants, I don't understand why you complain :P
> >> >> > > <sanne> Anyway, let's wrap up and broaden the horizon:
> >> >> > > <emmanuel> ok so we are left with findin to to load a entity from a tuple
> >> >> > > <sanne> you don't think it's useful as a general purpose method?
> >> >> > > <emmanuel> sanne: wil be for queries
> >> >> > > <emmanuel> It's just that it's non obvious
> >> >> > > <sanne> Exactly. Also I think lambda methods are getting widely better known.
> >> >> > > <emmanuel> syntactically yes
> >> >> > > <emmanuel> VM wise, perf improvements will come later
> >> >> > > <sanne> what I mean is that by defining the SPI this way, I don't
> >> >> > > expect it to be more complex for the GridDialect implementors, while
> >> >> > > we can reuse it for a wider scope of needs.
> >> >> > >
> >> >> > >  --Sanne
> >> >> > >
> >> >> > > On 4 March 2013 17:02, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> >> >> > >>
> >> >> > >>
> >> >> > >> On 4 mars 2013, at 17:39, Sanne Grinovero <sanne at hibernate.org> wrote:
> >> >> > >>
> >> >> > >>> On 4 March 2013 16:20, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> >> >> > >>>> I already gave what I knew on how to load an entity from a tuple (which
> >> >> > >>>> isn't much) but we can try and dig together. Something I thought about
> >> >> > >>>> is that ORM probably has a mechanism to load an entity from a resultset
> >> >> > >>>> via the query parser. And that probably looks also like the second half
> >> >> > >>>> of OgmLoader.load. We could look at this part and see if we can make an
> >> >> > >>>> OGM version of it. We never had the need before as we never had query
> >> >> > >>>> support (the way SQL does it).
> >> >> > >>>
> >> >> > >>> I would also need to study the ORM code, but to add a high level observation,
> >> >> > >>> the methods currently defined by the GridDialect are focusing on
> >> >> > >>> loading from well known key instances,
> >> >> > >>> there is nothing to makes us able to scan/inspect for all values.
> >> >> > >>>
> >> >> > >>> In other words: even if we wanted to load keys first, we don't have definitions
> >> >> > >>> of functions from raw->primary key instances either.
> >> >> > >>
> >> >> > >> I understand that. I'm not denying the need for the method.
> >> >> > >>
> >> >> > >>>
> >> >> > >>>
> >> >> > >>>> On the visitor vs Iterator approach, I still don't see how implementing
> >> >> > >>>> an Iterator on a map / reduce backend would be harder than the visitor
> >> >> > >>>> but maybe I'm missing something.
> >> >> > >>>>
> >> >> > >>>>    class IteratorAsStream {
> >> >> > >>>>        final Query someMapReduceQuery = ...;
> >> >> > >>>>
> >> >> > >>>>        public Object next() {
> >> >> > >>>>            if (!someMapReduceQuery.started()) {
> >> >> > >>>>                // execute and collect results in parallel
> >> >> > >>>>                someMapReduceQuery.execute();
> >> >> > >>>>            }
> >> >> > >>>>            Object result = someMapReduce.getNextOrBlock();
> >> >> > >>>>            return result;
> >> >> > >>>>        }
> >> >> > >>>>    }
> >> >> > >>>
> >> >> > >>> That could work to *load* all entities in parallel, but I'd like to
> >> >> > >>> process the entities in parallel as well.
> >> >> > >>> And I'd rather not force the GridDialect implementors to write some
> >> >> > >>> Hibernate Search specific code,
> >> >> > >>> so to break out we need some form of "Execute X on each": a closure or a lambda.
> >> >> > >>>
> >> >> > >>
> >> >> > >> I can't see how the visitor model helps in your processing of entities in parallel. To me both approaches are strictly equivalent. Care to show some pseudo-code?
> >> >> _______________________________________________
> >> >> hibernate-dev mailing list
> >> >> hibernate-dev at lists.jboss.org
> >> >> https://lists.jboss.org/mailman/listinfo/hibernate-dev
> > _______________________________________________
> > hibernate-dev mailing list
> > hibernate-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hibernate-dev