Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Wednesday, 6 March 2013

I've successfully implemented OGM-151 for EntityKey which is the one we
need to move OGM-273 forward for now.
I am trying to implement it for AssociationKey but caching here is
significantly harder as data is cross reference across associations.

Sanne, when you worked on the profiling of OGM, do you remember
AssociationKey putting a pressure in build time or memory wise? Because
caching them per persister means some rather complex race conditions and
more memory used permanently (as opposed to on demand).

So I'm wondering if that's worth it. As an intermediary step, I could
introduce AssociationKeyMetadata but build it on-demand - that one is
easier to achieve.

Emmanuel

On Wed 2013-03-06 15:32, Davide D'Alto wrote:
...
 it's ok for me

 Davide

 On Wed, Mar 6, 2013 at 3:28 PM, Emmanuel Bernard <emmanuel(a)hibernate.org&gt; wrote:
 > I'm planning on working on OGM-151. Fine with everyone?
 > That will likely be my last before I move back to BVAL and close the
 > final issues there.
 >
 > Emmanuel
 >
 > On Tue 2013-03-05 19:04, Sanne Grinovero wrote:
 >> Nice!
 >> n+1 is something Hibernate Search has to deal with too, that's why I
 >> was interested in the fetch profiles and graph loading in JPA 2.1
 >>
 >> On 5 March 2013 17:44, Emmanuel Bernard <emmanuel(a)hibernate.org&gt; wrote:
 >> > I have implemented a solution that gives an entity based on a tuple.
 >> > https://hibernate.onjira.com/browse/OGM-273#comment-50082
 >> >
 >> > Note that it does not currently works for MongoDB, but that's waiting
 >> > for the dedicated GridDialect method as well as OGM-151.
 >> > Also note that I have no idea how that will work for associations. I
 >> > suspect some nasty n+1 is happening as best. Worse case, an exception :)
 >> >
 >> > Emmanuel
 >> >
 >> > On Tue 2013-03-05 10:30, Emmanuel Bernard wrote:
 >> >> We might hope for a stable enough contract on Hibernate Search and
 >> >> hope that we won't break serializability between micro or minor
 >> >> versions. That will need to be taken into account in the test suite
and
 >> >> design.
 >> >> On the OGM side though, we are not at that level of maturity and we
will
 >> >> force homogenous Hibernate OGM version across all the cluster. The
grid
 >> >> will have to go down for upgrades or enforce that no mpa reduce job
 >> >> using OGM is used while the version roll out is in process.
 >> >>
 >> >> Emmanuel
 >> >>
 >> >> On Mon 2013-03-04 18:30, Sanne Grinovero wrote:
 >> >> > Found an example, this is all the code it needs to have a
MassIndexer working
 >> >> > on top of Infinispan's Map/Reduce:
 >> >> >
 >> >> >
https://github.com/infinispan/infinispan/blob/master/query/src/main/java/...
 >> >> >
 >> >> > Note it's initialize method which injects needed components;
the
 >> >> > implementation is serialized across nodes.
 >> >> >
 >> >> > Sanne
 >> >> >
 >> >> > On 4 March 2013 18:26, Sanne Grinovero <sanne(a)hibernate.org&gt;
wrote:
 >> >> > > We finished this discussion on IRC, in case someone else was
interested:
 >> >> > >
 >> >> > > <sanne> hum I forgot the first step.. transformation
from entry into entity
 >> >> > > <sanne> updated
 >> >> > > <sanne> emmanuel, the "hidrate" step is what
DavideD is bashing is
 >> >> > > head against, but let's assume he finds a workaround and
we focus on
 >> >> > > the pattern as first step?
 >> >> > > <emmanuel>
https://gist.github.com/emmanuelbernard/5084039
 >> >> > > <emmanuel> sanne: ^ that's how I would do it if I
had an Iterator from the tuple
 >> >> > > <emmanuel> assuming pushToExecutor pushes to whatever
concurrent work
 >> >> > > mechanism you planned to use on consumes
 >> >> > > <emmanuel> Plus I am not folloing exactly how you plan
consumes(Entry)
 >> >> > > to be executed concurrently
 >> >> > > <emmanuel> is that the GridDialect responsibility?
 >> >> > > <emmanuel> That looks like a lot of work on the
dialect's side
 >> >> > > <sanne> emmanuel, imagine the backend is Infinispan and
has some large
 >> >> > > amount of data per node, plus that each node has its own
backend
 >> >> > > IndexManager (like and ideal sharding)
 >> >> > > <emmanuel> ie pool mgt and cap +  queuing
 >> >> > > <sanne> then with your approach the iterator needs to
fetch data from
 >> >> > > all remote nodes, and then enqueue in a local blocking queue
which is
 >> >> > > returning the data to the original owners
 >> >> > > <sanne> but if you skip that step, you can just forward
the statless
 >> >> > > consumer to each node and have it run on data locality
 >> >> > > <emmanuel> I was thinking that if you had the luncene
index locally on
 >> >> > > each node you would ahve a different impl of the MassIndexer
anyways
 >> >> > > <emmanuel> that would simply send a command to each
local node
 >> >> > > <sanne> To answer your question: that would be an
optional GridDialect
 >> >> > > responsibility. I would endorse a trivial first draft doing
a
 >> >> > > single-threaded loop.
 >> >> > > <emmanuel> and have GridDialect.getDataFor()
returnlocal data
 >> >> > > <sanne> The "consumes" implementation can be
either implemented with a
 >> >> > > simple iterator - as in your design - so I don't think it
pushes much
 >> >> > > complexity to the GridDialect implementor?
 >> >> > > <sanne> The benefit of the consumer is that
*optionally* it can be
 >> >> > > mapped on the Map phase, and that's trivial if your
backend supports
 >> >> > > Map/Reduce
 >> >> > > <emmanuel> sanne: I don't follow that soory
 >> >> > > <emmanuel> how does that make it mappable to the Map
phase?
 >> >> > > <sanne> "public void consume(Entry e) " is a
degenerate (simplified)
 >> >> > > form of map.
 >> >> > > <sanne> mm infinispan IDE crashes at the right moment.
 >> >> > > <emmanuel> I thought Map was about *filtering*
 >> >> > > <emmanuel> not processing
 >> >> > > <sanne> you can decide to accept 100% of values
(without filtering),
 >> >> > > but actually you might want to filter on the specified tables
only.
 >> >> > > <sanne> also, the return type doesn't have to match
the input type:
 >> >> > > hence you define a transformation function, which is
inherently
 >> >> > > applied in parallel on all matching entries.
 >> >> > > <emmanuel> sanne: but then you require the OGM code to
be everywhere
 >> >> > > (ie on each node of the targetNoSQL
 >> >> > > <emmanuel> to eb able to do tuple -> entity
 >> >> > > <emmanuel> that's not realistic
 >> >> > > <emmanuel> assuming your transform phase is about tuple
-> entity and
 >> >> > > some HSearch ops
 >> >> > > <sanne> yes right
 >> >> > > <sanne> but isn;t it worth it? it's optional and
much more efficient,
 >> >> > > as you avoid transferring any data.
 >> >> > > <sanne> btw we often assume all nodes in the grid are
equally
 >> >> > > configured, so having same apps & libraries deployed.
 >> >> > > <emmanuel> sanne: let me try and summarize what I
understand
 >> >> > > <emmanuel> it's more efficient if you store the
Lucene index locally
 >> >> > > with the data, and if the grid is written in Java or at least
can run
 >> >> > > code in Java including libraries and if you distribute the
OGM
 >> >> > > configuration across the whole grid
 >> >> > > <emmanuel> Otherwise, it does not make any difference
 >> >> > > <emmanuel> Also the GridDialect implementation need to
know if you are
 >> >> > > doing this trick to only return local data
 >> >> > > <sanne> no there are other drawbacks which get
defeated, but minor so
 >> >> > > I didn't mention them
 >> >> > > <emmanuel> am I right?
 >> >> > > <sanne> mainly, you skip the need for the contentions
point as there
 >> >> > > is no push to a shared blocking queue
 >> >> > > <sanne> no the GridDialect doesn't need to know.
 >> >> > > <emmanuel> sanne: sure if you can process the code on
each node you
 >> >> > > avoid the shared blocking queue, at lest until you reach the
 >> >> > > IndexManager
 >> >> > > <sanne> you'll just forward a simple (standard) M/R
task, and it will
 >> >> > > need to execute it as always.
 >> >> > > <sanne> the IndexManager is parallel ;)
 >> >> > > <emmanuel> sanne: parallel on a single node
 >> >> > > <sanne> yes, but no contentions points other than the
internal
 >> >> > > structure of the IW
 >> >> > > <emmanuel> I mean updating the index for a given table
is better done
 >> >> > > on a singlle node
 >> >> > > <sanne> IndexWriter
 >> >> > > <emmanuel> sorry I meant IndexWriter
 >> >> > > <emmanuel> ah but ou mention perfect sharding
 >> >> > > <emmanuel> you need cosmological alignment for this
shit to happen
 >> >> > > <sanne> not if we plan for it :)
 >> >> > > <sanne> you might remember the changes to Segments in
the ISPN code,
 >> >> > > to accomodate index storage consistent with the data
locality
 >> >> > > <sanne> that's expected in 6.0
 >> >> > > <emmanuel> So gridDialect.getData(Consumer consumer,
String.. tables) is wrong
 >> >> > > <emmanuel> it's more
gridDialect.getData(ConsumerImpl.class, String... tables)
 >> >> > > <emmanuel> as you ened to send the Comsumer impl
 >> >> > > <emmanuel> not simply use it
 >> >> > > <sanne> hu, it needs a reference to the current
SearchFactory at very least
 >> >> > > <emmanuel> sanne: but you're telling me you send
the M/R task
 >> >> > > <emmanuel> so you need to send the M/R code as well
 >> >> > > <sanne> yes but here we enter Infinspan specific
implementation
 >> >> > > <sanne> I would register the needed components in
Infinispan and use
 >> >> > > the ServiceRegistry to look them up remotely
 >> >> > > <sanne> not to mention Infinispan could accomodate a
custom command for it
 >> >> > > <emmanuel> What I am saying is that you don't pass
the Consumer
 >> >> > > *instance* tot he grid dialect but rather the impl, no?
 >> >> > > <sanne> the impl class definition?
 >> >> > > <emmanuel> sanne: you tell me. How do I send M/R code
today?
 >> >> > > <emmanuel> certainly not an impl instance
 >> >> > > <sanne> yes you do
 >> >> > > <sanne> JBMar will take care of it, including state.
 >> >> > > <sanne> but in this case that would be wrong of course
as I don't want
 >> >> > > to serialize the whole SearchFactory so I'd use injection
and lookup,
 >> >> > > but that's a detail of Infinispan.
 >> >> > > <sanne> But this shouldn't be MassIndexer specific
right? it's good to
 >> >> > > expose a general "execute on all" method, and I
think accepting
 >> >> > > instances would make life easier for most - even though we
might need
 >> >> > > to document some limitations.
 >> >> > > <emmanuel> alright, I guess 'll have to live with a
visitor pattern
 >> >> > > for a feature that has 5% chance of happening :)
 >> >> > > <sanne> I'm going to punch Davide
 >> >> > > <sanne> as he's yelling "it's not a
visitor" but doesn't have the guts
 >> >> > > to write it down :)
 >> >> > > <emmanuel> sanne: DavideD 's would have nothing to
do about it, that's
 >> >> > > requires a lot of config and Infinispan machinery I'm not
sure is here
 >> >> > > today
 >> >> > > <DavideD> :)
 >> >> > > <emmanuel> ah
 >> >> > > <emmanuel> I don't care how it's called,
it's one of those patterns
 >> >> > > that make the code harder to follow
 >> >> > > <DavideD> I was actually trying to remember the name of
the pattern
 >> >> > > <sanne> ok now we agree :)
 >> >> > > <emmanuel> Obfuscator pattern family
 >> >> > > <sanne> very popular among consultants, I don't
understand why you complain :P
 >> >> > > <sanne> Anyway, let's wrap up and broaden the
horizon:
 >> >> > > <emmanuel> ok so we are left with findin to to load a
entity from a tuple
 >> >> > > <sanne> you don't think it's useful as a
general purpose method?
 >> >> > > <emmanuel> sanne: wil be for queries
 >> >> > > <emmanuel> It's just that it's non obvious
 >> >> > > <sanne> Exactly. Also I think lambda methods are
getting widely better known.
 >> >> > > <emmanuel> syntactically yes
 >> >> > > <emmanuel> VM wise, perf improvements will come later
 >> >> > > <sanne> what I mean is that by defining the SPI this
way, I don't
 >> >> > > expect it to be more complex for the GridDialect
implementors, while
 >> >> > > we can reuse it for a wider scope of needs.
 >> >> > >
 >> >> > >  --Sanne
 >> >> > >
 >> >> > > On 4 March 2013 17:02, Emmanuel Bernard
<emmanuel(a)hibernate.org&gt; wrote:
 >> >> > >>
 >> >> > >>
 >> >> > >> On 4 mars 2013, at 17:39, Sanne Grinovero
<sanne(a)hibernate.org&gt; wrote:
 >> >> > >>
 >> >> > >>> On 4 March 2013 16:20, Emmanuel Bernard
<emmanuel(a)hibernate.org&gt; wrote:
 >> >> > >>>> I already gave what I knew on how to load an
entity from a tuple (which
 >> >> > >>>> isn't much) but we can try and dig together.
Something I thought about
 >> >> > >>>> is that ORM probably has a mechanism to load an
entity from a resultset
 >> >> > >>>> via the query parser. And that probably looks
also like the second half
 >> >> > >>>> of OgmLoader.load. We could look at this part and
see if we can make an
 >> >> > >>>> OGM version of it. We never had the need before
as we never had query
 >> >> > >>>> support (the way SQL does it).
 >> >> > >>>
 >> >> > >>> I would also need to study the ORM code, but to add a
high level observation,
 >> >> > >>> the methods currently defined by the GridDialect are
focusing on
 >> >> > >>> loading from well known key instances,
 >> >> > >>> there is nothing to makes us able to scan/inspect for
all values.
 >> >> > >>>
 >> >> > >>> In other words: even if we wanted to load keys first,
we don't have definitions
 >> >> > >>> of functions from raw->primary key instances
either.
 >> >> > >>
 >> >> > >> I understand that. I'm not denying the need for the
method.
 >> >> > >>
 >> >> > >>>
 >> >> > >>>
 >> >> > >>>> On the visitor vs Iterator approach, I still
don't see how implementing
 >> >> > >>>> an Iterator on a map / reduce backend would be
harder than the visitor
 >> >> > >>>> but maybe I'm missing something.
 >> >> > >>>>
 >> >> > >>>>    class IteratorAsStream {
 >> >> > >>>>        final Query someMapReduceQuery = ...;
 >> >> > >>>>
 >> >> > >>>>        public Object next() {
 >> >> > >>>>            if (!someMapReduceQuery.started()) {
 >> >> > >>>>                // execute and collect results in
parallel
 >> >> > >>>>                someMapReduceQuery.execute();
 >> >> > >>>>            }
 >> >> > >>>>            Object result =
someMapReduce.getNextOrBlock();
 >> >> > >>>>            return result;
 >> >> > >>>>        }
 >> >> > >>>>    }
 >> >> > >>>
 >> >> > >>> That could work to *load* all entities in parallel,
but I'd like to
 >> >> > >>> process the entities in parallel as well.
 >> >> > >>> And I'd rather not force the GridDialect
implementors to write some
 >> >> > >>> Hibernate Search specific code,
 >> >> > >>> so to break out we need some form of "Execute X
on each": a closure or a lambda.
 >> >> > >>>
 >> >> > >>
 >> >> > >> I can't see how the visitor model helps in your
processing of entities in parallel. To me both approaches are strictly equivalent. Care to
show some pseudo-code?
 >> >> _______________________________________________
 >> >> hibernate-dev mailing list
 >> >> hibernate-dev(a)lists.jboss.org
 >> >> https://lists.jboss.org/mailman/listinfo/hibernate-dev
 > _______________________________________________
 > hibernate-dev mailing list
 > hibernate-dev(a)lists.jboss.org
 > https://lists.jboss.org/mailman/listinfo/hibernate-dev 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?