Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?

Tuesday, 19 March 2013

On Friday we had been pair-programming and likely finished the implementation:
it looks good but we couldn't run the test.
The blocker is that Map/Reduce on Infinispan only works on DIST, and since
we can't iterate on entries we need M/R so we might need to
reconfigure Infinispan in our tests.

That's annoying as DIST will make our testsuite significantly slower,
an alternative is to have Infinispan fix this limitation first.

Sanne

On 13 March 2013 11:14, Davide D'Alto <daltodavide(a)gmail.com&gt; wrote:
...
 No problem.

 The association is lazy but I will investigate about Hibernate.initialize

 On Tue, Mar 12, 2013 at 8:01 PM, Emmanuel Bernard
 <emmanuel(a)hibernate.org&gt; wrote:
> I have not forgotten, I'm just in a middle of a Bean Validation crisis
> that delayed my look into this issue.
> Could it be BTW that the mass indexer does not ask for these objects to
> be loaded using Hibernate.initialize ? It coudl also be a bug in OGM but
> not necessarily. In particular is the association lazy or eager?
>
> Emmanuel
>
> On Mon 2013-03-11 11:00, Davide D'Alto wrote:
>> I have created a branch for OGM-228 (OGM MassIndexer) that includes
>> OGM-151 (Metamodel) and OGM-273 (load entities from tuple):
>> https://github.com/DavideD/hibernate-ogm/tree/OGM-228
>>
>> A test I've added fails though (AssociationMassIndexerTest):
>>
https://github.com/DavideD/hibernate-ogm/blob/74549a4d264af30fa88960c30e2...
>>
>> The test uses two entitties IndexedNews and IndexedLabel, with a
>> relationship one to many from news to label.
>> The mass indexing works fine but when I retrieve the list of indexed
>> labels with the query "FROM IndexedLabel", the result contains a list
>> of proxy and the equals fails because the class of the objects in the
>> list is not IndexedLabel.
>>
>> If I first get the list of news and than for each of them I called the
>> method news.getLabels(), everything works fine.
>>
>> Any thoughts
>>
>> Thanks
>>
>> On Thu, Mar 7, 2013 at 10:15 AM, Emmanuel Bernard
>> <emmanuel(a)hibernate.org&gt; wrote:
>> > I have no more coin for this one so I have dumped what I have so far
>> > https://github.com/hibernate/hibernate-ogm/pull/175
>> >
>> > Emmanuel
>> >
>> > On Wed 2013-03-06 19:18, Emmanuel Bernard wrote:
>> >> I've successfully implemented OGM-151 for EntityKey which is the one
we
>> >> need to move OGM-273 forward for now.
>> >> I am trying to implement it for AssociationKey but caching here is
>> >> significantly harder as data is cross reference across associations.
>> >>
>> >> Sanne, when you worked on the profiling of OGM, do you remember
>> >> AssociationKey putting a pressure in build time or memory wise? Because
>> >> caching them per persister means some rather complex race conditions
and
>> >> more memory used permanently (as opposed to on demand).
>> >>
>> >> So I'm wondering if that's worth it. As an intermediary step, I
could
>> >> introduce AssociationKeyMetadata but build it on-demand - that one is
>> >> easier to achieve.
>> >>
>> >> Emmanuel
>> >>
>> >> On Wed 2013-03-06 15:32, Davide D'Alto wrote:
>> >> > it's ok for me
>> >> >
>> >> > Davide
>> >> >
>> >> > On Wed, Mar 6, 2013 at 3:28 PM, Emmanuel Bernard
<emmanuel(a)hibernate.org&gt; wrote:
>> >> > > I'm planning on working on OGM-151. Fine with everyone?
>> >> > > That will likely be my last before I move back to BVAL and
close the
>> >> > > final issues there.
>> >> > >
>> >> > > Emmanuel
>> >> > >
>> >> > > On Tue 2013-03-05 19:04, Sanne Grinovero wrote:
>> >> > >> Nice!
>> >> > >> n+1 is something Hibernate Search has to deal with too,
that's why I
>> >> > >> was interested in the fetch profiles and graph loading in
JPA 2.1
>> >> > >>
>> >> > >> On 5 March 2013 17:44, Emmanuel Bernard
<emmanuel(a)hibernate.org&gt; wrote:
>> >> > >> > I have implemented a solution that gives an entity
based on a tuple.
>> >> > >> >
https://hibernate.onjira.com/browse/OGM-273#comment-50082
>> >> > >> >
>> >> > >> > Note that it does not currently works for MongoDB,
but that's waiting
>> >> > >> > for the dedicated GridDialect method as well as
OGM-151.
>> >> > >> > Also note that I have no idea how that will work for
associations. I
>> >> > >> > suspect some nasty n+1 is happening as best. Worse
case, an exception :)
>> >> > >> >
>> >> > >> > Emmanuel
>> >> > >> >
>> >> > >> > On Tue 2013-03-05 10:30, Emmanuel Bernard wrote:
>> >> > >> >> We might hope for a stable enough contract on
Hibernate Search and
>> >> > >> >> hope that we won't break serializability
between micro or minor
>> >> > >> >> versions. That will need to be taken into account
in the test suite and
>> >> > >> >> design.
>> >> > >> >> On the OGM side though, we are not at that level
of maturity and we will
>> >> > >> >> force homogenous Hibernate OGM version across all
the cluster. The grid
>> >> > >> >> will have to go down for upgrades or enforce that
no mpa reduce job
>> >> > >> >> using OGM is used while the version roll out is
in process.
>> >> > >> >>
>> >> > >> >> Emmanuel
>> >> > >> >>
>> >> > >> >> On Mon 2013-03-04 18:30, Sanne Grinovero wrote:
>> >> > >> >> > Found an example, this is all the code it
needs to have a MassIndexer working
>> >> > >> >> > on top of Infinispan's Map/Reduce:
>> >> > >> >> >
>> >> > >> >> >
https://github.com/infinispan/infinispan/blob/master/query/src/main/java/...
>> >> > >> >> >
>> >> > >> >> > Note it's initialize method which
injects needed components; the
>> >> > >> >> > implementation is serialized across nodes.
>> >> > >> >> >
>> >> > >> >> > Sanne
>> >> > >> >> >
>> >> > >> >> > On 4 March 2013 18:26, Sanne Grinovero
<sanne(a)hibernate.org&gt; wrote:
>> >> > >> >> > > We finished this discussion on IRC, in
case someone else was interested:
>> >> > >> >> > >
>> >> > >> >> > > <sanne> hum I forgot the first
step.. transformation from entry into entity
>> >> > >> >> > > <sanne> updated
>> >> > >> >> > > <sanne> emmanuel, the
"hidrate" step is what DavideD is bashing is
>> >> > >> >> > > head against, but let's assume he
finds a workaround and we focus on
>> >> > >> >> > > the pattern as first step?
>> >> > >> >> > > <emmanuel>
https://gist.github.com/emmanuelbernard/5084039
>> >> > >> >> > > <emmanuel> sanne: ^ that's
how I would do it if I had an Iterator from the tuple
>> >> > >> >> > > <emmanuel> assuming
pushToExecutor pushes to whatever concurrent work
>> >> > >> >> > > mechanism you planned to use on
consumes
>> >> > >> >> > > <emmanuel> Plus I am not folloing
exactly how you plan consumes(Entry)
>> >> > >> >> > > to be executed concurrently
>> >> > >> >> > > <emmanuel> is that the
GridDialect responsibility?
>> >> > >> >> > > <emmanuel> That looks like a lot
of work on the dialect's side
>> >> > >> >> > > <sanne> emmanuel, imagine the
backend is Infinispan and has some large
>> >> > >> >> > > amount of data per node, plus that each
node has its own backend
>> >> > >> >> > > IndexManager (like and ideal sharding)
>> >> > >> >> > > <emmanuel> ie pool mgt and cap + 
queuing
>> >> > >> >> > > <sanne> then with your approach
the iterator needs to fetch data from
>> >> > >> >> > > all remote nodes, and then enqueue in a
local blocking queue which is
>> >> > >> >> > > returning the data to the original
owners
>> >> > >> >> > > <sanne> but if you skip that
step, you can just forward the statless
>> >> > >> >> > > consumer to each node and have it run
on data locality
>> >> > >> >> > > <emmanuel> I was thinking that if
you had the luncene index locally on
>> >> > >> >> > > each node you would ahve a different
impl of the MassIndexer anyways
>> >> > >> >> > > <emmanuel> that would simply send
a command to each local node
>> >> > >> >> > > <sanne> To answer your question:
that would be an optional GridDialect
>> >> > >> >> > > responsibility. I would endorse a
trivial first draft doing a
>> >> > >> >> > > single-threaded loop.
>> >> > >> >> > > <emmanuel> and have
GridDialect.getDataFor() returnlocal data
>> >> > >> >> > > <sanne> The "consumes"
implementation can be either implemented with a
>> >> > >> >> > > simple iterator - as in your design -
so I don't think it pushes much
>> >> > >> >> > > complexity to the GridDialect
implementor?
>> >> > >> >> > > <sanne> The benefit of the
consumer is that *optionally* it can be
>> >> > >> >> > > mapped on the Map phase, and that's
trivial if your backend supports
>> >> > >> >> > > Map/Reduce
>> >> > >> >> > > <emmanuel> sanne: I don't
follow that soory
>> >> > >> >> > > <emmanuel> how does that make it
mappable to the Map phase?
>> >> > >> >> > > <sanne> "public void
consume(Entry e) " is a degenerate (simplified)
>> >> > >> >> > > form of map.
>> >> > >> >> > > <sanne> mm infinispan IDE crashes
at the right moment.
>> >> > >> >> > > <emmanuel> I thought Map was
about *filtering*
>> >> > >> >> > > <emmanuel> not processing
>> >> > >> >> > > <sanne> you can decide to accept
100% of values (without filtering),
>> >> > >> >> > > but actually you might want to filter
on the specified tables only.
>> >> > >> >> > > <sanne> also, the return type
doesn't have to match the input type:
>> >> > >> >> > > hence you define a transformation
function, which is inherently
>> >> > >> >> > > applied in parallel on all matching
entries.
>> >> > >> >> > > <emmanuel> sanne: but then you
require the OGM code to be everywhere
>> >> > >> >> > > (ie on each node of the targetNoSQL
>> >> > >> >> > > <emmanuel> to eb able to do tuple
-> entity
>> >> > >> >> > > <emmanuel> that's not
realistic
>> >> > >> >> > > <emmanuel> assuming your
transform phase is about tuple -> entity and
>> >> > >> >> > > some HSearch ops
>> >> > >> >> > > <sanne> yes right
>> >> > >> >> > > <sanne> but isn;t it worth it?
it's optional and much more efficient,
>> >> > >> >> > > as you avoid transferring any data.
>> >> > >> >> > > <sanne> btw we often assume all
nodes in the grid are equally
>> >> > >> >> > > configured, so having same apps &
libraries deployed.
>> >> > >> >> > > <emmanuel> sanne: let me try and
summarize what I understand
>> >> > >> >> > > <emmanuel> it's more
efficient if you store the Lucene index locally
>> >> > >> >> > > with the data, and if the grid is
written in Java or at least can run
>> >> > >> >> > > code in Java including libraries and if
you distribute the OGM
>> >> > >> >> > > configuration across the whole grid
>> >> > >> >> > > <emmanuel> Otherwise, it does not
make any difference
>> >> > >> >> > > <emmanuel> Also the GridDialect
implementation need to know if you are
>> >> > >> >> > > doing this trick to only return local
data
>> >> > >> >> > > <sanne> no there are other
drawbacks which get defeated, but minor so
>> >> > >> >> > > I didn't mention them
>> >> > >> >> > > <emmanuel> am I right?
>> >> > >> >> > > <sanne> mainly, you skip the need
for the contentions point as there
>> >> > >> >> > > is no push to a shared blocking queue
>> >> > >> >> > > <sanne> no the GridDialect
doesn't need to know.
>> >> > >> >> > > <emmanuel> sanne: sure if you can
process the code on each node you
>> >> > >> >> > > avoid the shared blocking queue, at
lest until you reach the
>> >> > >> >> > > IndexManager
>> >> > >> >> > > <sanne> you'll just forward a
simple (standard) M/R task, and it will
>> >> > >> >> > > need to execute it as always.
>> >> > >> >> > > <sanne> the IndexManager is
parallel ;)
>> >> > >> >> > > <emmanuel> sanne: parallel on a
single node
>> >> > >> >> > > <sanne> yes, but no contentions
points other than the internal
>> >> > >> >> > > structure of the IW
>> >> > >> >> > > <emmanuel> I mean updating the
index for a given table is better done
>> >> > >> >> > > on a singlle node
>> >> > >> >> > > <sanne> IndexWriter
>> >> > >> >> > > <emmanuel> sorry I meant
IndexWriter
>> >> > >> >> > > <emmanuel> ah but ou mention
perfect sharding
>> >> > >> >> > > <emmanuel> you need cosmological
alignment for this shit to happen
>> >> > >> >> > > <sanne> not if we plan for it :)
>> >> > >> >> > > <sanne> you might remember the
changes to Segments in the ISPN code,
>> >> > >> >> > > to accomodate index storage consistent
with the data locality
>> >> > >> >> > > <sanne> that's expected in
6.0
>> >> > >> >> > > <emmanuel> So
gridDialect.getData(Consumer consumer, String.. tables) is wrong
>> >> > >> >> > > <emmanuel> it's more
gridDialect.getData(ConsumerImpl.class, String... tables)
>> >> > >> >> > > <emmanuel> as you ened to send
the Comsumer impl
>> >> > >> >> > > <emmanuel> not simply use it
>> >> > >> >> > > <sanne> hu, it needs a reference
to the current SearchFactory at very least
>> >> > >> >> > > <emmanuel> sanne: but you're
telling me you send the M/R task
>> >> > >> >> > > <emmanuel> so you need to send
the M/R code as well
>> >> > >> >> > > <sanne> yes but here we enter
Infinspan specific implementation
>> >> > >> >> > > <sanne> I would register the
needed components in Infinispan and use
>> >> > >> >> > > the ServiceRegistry to look them up
remotely
>> >> > >> >> > > <sanne> not to mention Infinispan
could accomodate a custom command for it
>> >> > >> >> > > <emmanuel> What I am saying is
that you don't pass the Consumer
>> >> > >> >> > > *instance* tot he grid dialect but
rather the impl, no?
>> >> > >> >> > > <sanne> the impl class
definition?
>> >> > >> >> > > <emmanuel> sanne: you tell me.
How do I send M/R code today?
>> >> > >> >> > > <emmanuel> certainly not an impl
instance
>> >> > >> >> > > <sanne> yes you do
>> >> > >> >> > > <sanne> JBMar will take care of
it, including state.
>> >> > >> >> > > <sanne> but in this case that
would be wrong of course as I don't want
>> >> > >> >> > > to serialize the whole SearchFactory so
I'd use injection and lookup,
>> >> > >> >> > > but that's a detail of Infinispan.
>> >> > >> >> > > <sanne> But this shouldn't be
MassIndexer specific right? it's good to
>> >> > >> >> > > expose a general "execute on
all" method, and I think accepting
>> >> > >> >> > > instances would make life easier for
most - even though we might need
>> >> > >> >> > > to document some limitations.
>> >> > >> >> > > <emmanuel> alright, I guess
'll have to live with a visitor pattern
>> >> > >> >> > > for a feature that has 5% chance of
happening :)
>> >> > >> >> > > <sanne> I'm going to punch
Davide
>> >> > >> >> > > <sanne> as he's yelling
"it's not a visitor" but doesn't have the guts
>> >> > >> >> > > to write it down :)
>> >> > >> >> > > <emmanuel> sanne: DavideD 's
would have nothing to do about it, that's
>> >> > >> >> > > requires a lot of config and Infinispan
machinery I'm not sure is here
>> >> > >> >> > > today
>> >> > >> >> > > <DavideD> :)
>> >> > >> >> > > <emmanuel> ah
>> >> > >> >> > > <emmanuel> I don't care how
it's called, it's one of those patterns
>> >> > >> >> > > that make the code harder to follow
>> >> > >> >> > > <DavideD> I was actually trying
to remember the name of the pattern
>> >> > >> >> > > <sanne> ok now we agree :)
>> >> > >> >> > > <emmanuel> Obfuscator pattern
family
>> >> > >> >> > > <sanne> very popular among
consultants, I don't understand why you complain :P
>> >> > >> >> > > <sanne> Anyway, let's wrap up
and broaden the horizon:
>> >> > >> >> > > <emmanuel> ok so we are left with
findin to to load a entity from a tuple
>> >> > >> >> > > <sanne> you don't think
it's useful as a general purpose method?
>> >> > >> >> > > <emmanuel> sanne: wil be for
queries
>> >> > >> >> > > <emmanuel> It's just that
it's non obvious
>> >> > >> >> > > <sanne> Exactly. Also I think
lambda methods are getting widely better known.
>> >> > >> >> > > <emmanuel> syntactically yes
>> >> > >> >> > > <emmanuel> VM wise, perf
improvements will come later
>> >> > >> >> > > <sanne> what I mean is that by
defining the SPI this way, I don't
>> >> > >> >> > > expect it to be more complex for the
GridDialect implementors, while
>> >> > >> >> > > we can reuse it for a wider scope of
needs.
>> >> > >> >> > >
>> >> > >> >> > >  --Sanne
>> >> > >> >> > >
>> >> > >> >> > > On 4 March 2013 17:02, Emmanuel Bernard
<emmanuel(a)hibernate.org&gt; wrote:
>> >> > >> >> > >>
>> >> > >> >> > >>
>> >> > >> >> > >> On 4 mars 2013, at 17:39, Sanne
Grinovero <sanne(a)hibernate.org&gt; wrote:
>> >> > >> >> > >>
>> >> > >> >> > >>> On 4 March 2013 16:20, Emmanuel
Bernard <emmanuel(a)hibernate.org&gt; wrote:
>> >> > >> >> > >>>> I already gave what I knew
on how to load an entity from a tuple (which
>> >> > >> >> > >>>> isn't much) but we can
try and dig together. Something I thought about
>> >> > >> >> > >>>> is that ORM probably has a
mechanism to load an entity from a resultset
>> >> > >> >> > >>>> via the query parser. And
that probably looks also like the second half
>> >> > >> >> > >>>> of OgmLoader.load. We could
look at this part and see if we can make an
>> >> > >> >> > >>>> OGM version of it. We never
had the need before as we never had query
>> >> > >> >> > >>>> support (the way SQL does
it).
>> >> > >> >> > >>>
>> >> > >> >> > >>> I would also need to study the
ORM code, but to add a high level observation,
>> >> > >> >> > >>> the methods currently defined
by the GridDialect are focusing on
>> >> > >> >> > >>> loading from well known key
instances,
>> >> > >> >> > >>> there is nothing to makes us
able to scan/inspect for all values.
>> >> > >> >> > >>>
>> >> > >> >> > >>> In other words: even if we
wanted to load keys first, we don't have definitions
>> >> > >> >> > >>> of functions from
raw->primary key instances either.
>> >> > >> >> > >>
>> >> > >> >> > >> I understand that. I'm not
denying the need for the method.
>> >> > >> >> > >>
>> >> > >> >> > >>>
>> >> > >> >> > >>>
>> >> > >> >> > >>>> On the visitor vs Iterator
approach, I still don't see how implementing
>> >> > >> >> > >>>> an Iterator on a map /
reduce backend would be harder than the visitor
>> >> > >> >> > >>>> but maybe I'm missing
something.
>> >> > >> >> > >>>>
>> >> > >> >> > >>>>    class IteratorAsStream
{
>> >> > >> >> > >>>>        final Query
someMapReduceQuery = ...;
>> >> > >> >> > >>>>
>> >> > >> >> > >>>>        public Object next()
{
>> >> > >> >> > >>>>            if
(!someMapReduceQuery.started()) {
>> >> > >> >> > >>>>                // execute
and collect results in parallel
>> >> > >> >> > >>>>               
someMapReduceQuery.execute();
>> >> > >> >> > >>>>            }
>> >> > >> >> > >>>>            Object result =
someMapReduce.getNextOrBlock();
>> >> > >> >> > >>>>            return result;
>> >> > >> >> > >>>>        }
>> >> > >> >> > >>>>    }
>> >> > >> >> > >>>
>> >> > >> >> > >>> That could work to *load* all
entities in parallel, but I'd like to
>> >> > >> >> > >>> process the entities in
parallel as well.
>> >> > >> >> > >>> And I'd rather not force
the GridDialect implementors to write some
>> >> > >> >> > >>> Hibernate Search specific
code,
>> >> > >> >> > >>> so to break out we need some
form of "Execute X on each": a closure or a lambda.
>> >> > >> >> > >>>
>> >> > >> >> > >>
>> >> > >> >> > >> I can't see how the visitor
model helps in your processing of entities in parallel. To me both approaches are strictly
equivalent. Care to show some pseudo-code?
>> >> > >> >> _______________________________________________
>> >> > >> >> hibernate-dev mailing list
>> >> > >> >> hibernate-dev(a)lists.jboss.org
>> >> > >> >>
https://lists.jboss.org/mailman/listinfo/hibernate-dev
>> >> > > _______________________________________________
>> >> > > hibernate-dev mailing list
>> >> > > hibernate-dev(a)lists.jboss.org
>> >> > > https://lists.jboss.org/mailman/listinfo/hibernate-dev
>> >> _______________________________________________
>> >> hibernate-dev mailing list
>> >> hibernate-dev(a)lists.jboss.org
>> >> https://lists.jboss.org/mailman/listinfo/hibernate-dev
 _______________________________________________
 hibernate-dev mailing list
 hibernate-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/hibernate-dev 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [hibernate-dev] [OGM] Ogm mass indexer, how to convert Tuple/EntityKey to Entity/Id?