[hibernate-dev] Coordinates storage in Lucene index for spatial functionality
Nicolas Helleringer
nicolas.helleringer at gmail.com
Tue May 15 09:41:18 EDT 2012
>
> On a side note, what do you need System.exit(0); for ? You should
> close the SessionFactory.
>
Because i m better with geo/data than with code =)
Thanks for pointing me the right direction.
The last numbers series is from a 50k calls run in radian mode that lasted
45 minutes.
For each center the bench runs the 4 modes of request ending up in ~45 ms
the loop.
Niko
> Cheers,
> Sanne
>
> On 15 May 2012 14:04, Nicolas Helleringer <nicolas.helleringer at gmail.com>
> wrote:
> > I did the seed on the random generator.
> >
> > Here are some results:
> >
> > Degrees 2K calls
> > Mean time with Grid : 4.769457488717949 ms. Average number of docs
> fetched
> > : 2524.982564102564
> > Mean time with Grid + Distance filter : 6.501712946153845 ms. Average
> number
> > of docs fetched : 426.1876923076923
> > Mean time with DoubleRange : 14.336663392307692 ms. Average number of
> docs
> > fetched : 543.6035897435897
> > Mean time with DoubleRange + Distance filter : 19.7123163574359 ms.
> Average
> > number of docs fetched : 426.1876923076923
> >
> > Radians 2K calls
> > Mean time with Grid : 4.430686068205128 ms. Average number of docs
> fetched
> > : 2524.982564102564
> > Mean time with Grid + Distance filter : 6.717519717948718 ms. Average
> number
> > of docs fetched : 426.1876923076923
> > Mean time with DoubleRange : 14.35186034 ms. Average number of docs
> fetched
> > : 543.6035897435897
> > Mean time with DoubleRange + Distance filter : 20.073972284102563 ms.
> > Average number of docs fetched : 426.1876923076923
> >
> > Radians 50k calls
> > Mean time with Grid : 4.440979528643216 ms. Average number of docs
> fetched
> > : 2459.169386934673
> > Mean time with Grid + Distance filter : 6.722681398331658 ms. Average
> number
> > of docs fetched : 416.2335879396985
> > Mean time with DoubleRange : 14.532376860201005 ms. Average number of
> docs
> > fetched : 530.2923618090452
> > Mean time with DoubleRange + Distance filter : 20.21980649284422 ms.
> Average
> > number of docs fetched : 416.2335879396985
> >
> > On the random part you can see by looking at the average umber of docs on
> > the 2k calls that the seed did its works, the requests are the same.
> >
> > As you can see there is not such a difference between 2k and 50k calls
> runs.
> >
> > What I have investigated too is the overhead of the distance filter over
> the
> > double range approach. I do fear that the wrapping
> > of the lat,long range query in a QueryWrapperFilter is costly but i
> cannnot
> > prove it, yet.
> >
> > Back to the main question : does radian storage gives better performance
> ? I
> > cannot say with my test env. It seems pretty close to me.
> > Maybe if someone manages to launch the bench on a different
> environnement.
> >
> > Niko
> >
> > PS : both branches are up to date in my github
> > :
> https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923 &
> https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
> >
> > 2012/5/14 Nicolas Helleringer <nicolas.helleringer at gmail.com>
> >>>
> >>> maybe even simpler set a constant as the seed of your random
> >>> generator: should provide a reproducible sequence of values.
> >>
> >> /facepalm
> >> I should have guess that :s
> >>
> >> Niko
> >>
> >>>
> >>> >>
> >>> >> On 11 May 2012 08:40, Nicolas Helleringer
> >>> >> <nicolas.helleringer at gmail.com>
> >>> >> wrote:
> >>> >> > There, back and again ...
> >>> >> >
> >>> >> > After fixing a bug in grid search here are some updated results on
> >>> >> > 2k
> >>> >> > calls
> >>> >> >
> >>> >> > Degrees :
> >>> >> > Mean time with Grid : 4.4897266425641025 ms. Average number of
> docs
> >>> >> > fetched
> >>> >> > : 2506.96
> >>> >> > Mean time with Grid + Distance filter : 6.4930799487179485 ms.
> >>> >> > Average
> >>> >> > number of docs fetched : 425.33435897435896
> >>> >> > Mean time with DoubleRange : 14.430638703076923 ms. Average number
> >>> >> > of
> >>> >> > docs
> >>> >> > fetched : 542.0410256410256
> >>> >> > Mean time with DoubleRange + Distance filter : 20.483300545128206
> >>> >> > ms.
> >>> >> > Average number of docs fetched : 425.33435897435896
> >>> >> >
> >>> >> > Radians :
> >>> >> > Mean time with Grid : 5.650845744102564 ms. Average number of docs
> >>> >> > fetched
> >>> >> > : 5074.830769230769
> >>> >> > Mean time with Grid + Distance filter : 8.627138825128204 ms.
> >>> >> > Average
> >>> >> > number
> >>> >> > of docs fetched : 426.7902564102564
> >>> >> > Mean time with DoubleRange : 15.337755502564102 ms. Average number
> >>> >> > of
> >>> >> > docs
> >>> >> > fetched : 1087.705641025641
> >>> >> > Mean time with DoubleRange + Distance filter : 20.82852138769231
> ms.
> >>> >> > Average
> >>> >> > number of docs fetched : 426.7902564102564
> >>> >> >
> >>> >> > Next thing I do not explain yet is the distance filter overhead
> >>> >> > mismatch
> >>> >> > :
> >>> >> > It is less on grid search with more docs to test than on
> >>> >> > DoubleRange.
> >>> >> >
> >>> >> > Niko
> >>> >> >
> >>> >> >
> >>> >> > 2012/5/7 Nicolas Helleringer <nicolas.helleringer at gmail.com>
> >>> >> >>
> >>> >> >> Here are some results :
> >>> >> >>
> >>> >> >> Mean time with Grid : 4.9297471630769225 ms. Average number of
> docs
> >>> >> >> fetched : 2416.373846153846
> >>> >> >> Mean time with Grid + Distance filter : 6.48634534 ms. Average
> >>> >> >> number
> >>> >> >> of
> >>> >> >> docs fetched : 425.84
> >>> >> >> Mean time with DoubleRange : 15.39593650051282 ms. Average number
> >>> >> >> of
> >>> >> >> docs
> >>> >> >> fetched : 542.72
> >>> >> >> Mean time with DoubleRange + Distance filter : 21.158394677435897
> >>> >> >> ms.
> >>> >> >> Average number of docs fetched : 425.8779487179487
> >>> >> >>
> >>> >> >> Sounds weird that with distance filter the two results are note
> the
> >>> >> >> same.
> >>> >> >> I shall investigate that.
> >>> >> >>
> >>> >> >> Niko
> >>> >> >>
> >>> >> >> 2012/5/7 Emmanuel Bernard <emmanuel at hibernate.org>
> >>> >> >>>
> >>> >> >>> Do you know the average amount of POI that were filtered in
> memory
> >>> >> >>> but
> >>> >> >>> the DistanceFilter during these runs?
> >>> >> >>>
> >>> >> >>> Emmanuel
> >>> >> >>>
> >>> >> >>> On 7 mai 2012, at 10:31, Nicolas Helleringer wrote:
> >>> >> >>>
> >>> >> >>> Hi all,
> >>> >> >>>
> >>> >> >>> I have done a radian patch/branch and some benchmarks on
> geonames
> >>> >> >>> french
> >>> >> >>> database.
> >>> >> >>>
> >>> >> >>> Benchs are on 2k calls each run.
> >>> >> >>>
> >>> >> >>> Radians:
> >>> >> >>> run 1
> >>> >> >>> Mean time with Grid : 4.808043092820513 ms
> >>> >> >>> Mean time with Grid + Distance filter : 6.571108878461538 ms
> >>> >> >>> Mean time with DoubleRange : 14.62661525128205 ms
> >>> >> >>> Mean time with DoubleRange + Distance filter :
> 20.143597923076925
> >>> >> >>> ms
> >>> >> >>>
> >>> >> >>> run 2
> >>> >> >>> Mean time with Grid : 5.290368523076923 ms
> >>> >> >>> Mean time with Grid + Distance filter : 6.706567517435897 ms
> >>> >> >>> Mean time with DoubleRange : 14.878960702564102 ms
> >>> >> >>> Mean time with DoubleRange + Distance filter : 20.75806591948718
> >>> >> >>> ms
> >>> >> >>>
> >>> >> >>> Degrees:
> >>> >> >>> run 1
> >>> >> >>> Mean time with Grid : 5.101956610769231 ms
> >>> >> >>> Mean time with Grid + Distance filter : 6.548685109230769 ms
> >>> >> >>> Mean time with DoubleRange : 14.767478146153845 ms
> >>> >> >>> Mean time with DoubleRange + Distance filter :
> 20.668063972820512
> >>> >> >>> ms
> >>> >> >>>
> >>> >> >>> run 2
> >>> >> >>> Mean time with Grid : 4.683360031282051 ms
> >>> >> >>> Mean time with Grid + Distance filter : 6.7065247435897435 ms
> >>> >> >>> Mean time with DoubleRange : 14.617140157948716 ms
> >>> >> >>> Mean time with DoubleRange + Distance filter :
> 20.074868595897435
> >>> >> >>> ms
> >>> >> >>>
> >>> >> >>> The radian branch is here for review
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> :
> https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
> >>> >> >>>
> >>> >> >>> While moving from degrees to radians I have seen that DSL has
> >>> >> >>> still
> >>> >> >>> some
> >>> >> >>> work to do.
> >>> >> >>> I shall focus on that now.
> >>> >> >>>
> >>> >> >>> Niko
> >>> >> >>>
> >>> >> >>> 2012/5/3 Sanne Grinovero <sanne at hibernate.org>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> On May 3, 2012 10:10 AM, "Emmanuel Bernard"
> >>> >> >>>> <emmanuel at hibernate.org>
> >>> >> >>>> wrote:
> >>> >> >>>> >
> >>> >> >>>> > How comes the DistanceFilter has to compute the distance for
> >>> >> >>>> > the
> >>> >> >>>> > whole
> >>> >> >>>> > corpus?
> >>> >> >>>>
> >>> >> >>>> You're right in that's not always the case, but it's possible.
> If
> >>> >> >>>> there
> >>> >> >>>> are more filters enabled and they are executed first, our
> filter
> >>> >> >>>> will
> >>> >> >>>> need
> >>> >> >>>> to do the math only on the matched documents by the previous
> >>> >> >>>> filters,
> >>> >> >>>> but if
> >>> >> >>>> there are no other constraints or filters our DistanceFilter
> >>> >> >>>> might
> >>> >> >>>> need to
> >>> >> >>>> process all documents in all segments. This happens also when a
> >>> >> >>>> limit
> >>> >> >>>> is
> >>> >> >>>> enabled on the collector - although limited to the current
> index
> >>> >> >>>> segment -
> >>> >> >>>> when the filter needs to be cached as it needs to evaluate each
> >>> >> >>>> document in
> >>> >> >>>> the segment.
> >>> >> >>>>
> >>> >> >>>> In our case this DistanceFilter is only applied after
> RangeQuery
> >>> >> >>>> was
> >>> >> >>>> applied on both longitude and latitude, so I'm not sure if this
> >>> >> >>>> is a
> >>> >> >>>> big
> >>> >> >>>> problem; personally I was just wondering but I'd be fine in
> >>> >> >>>> keeping
> >>> >> >>>> this as
> >>> >> >>>> a possible future improvement - but if we go for a separate
> >>> >> >>>> issue,
> >>> >> >>>> let's
> >>> >> >>>> keep in mind that that the index format would not be backwards
> >>> >> >>>> compatible.
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> > By the way the actual storage (say via Hibernate ORM, or
> >>> >> >>>> > Infinispan)
> >>> >> >>>> > does not need to store in radian, so we don't need to do a
> >>> >> >>>> > conversion when
> >>> >> >>>> > reading an entity.
> >>> >> >>>>
> >>> >> >>>> Right, another reason to index only in whatever format makes
> >>> >> >>>> querying
> >>> >> >>>> more efficient.
> >>> >> >>>>
> >>> >> >>>> -- Sanne
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> >
> >>> >> >>>> > On 3 mai 2012, at 10:45, Sanne Grinovero wrote:
> >>> >> >>>> >
> >>> >> >>>> > > The reason for my comment is that the code is doing a
> >>> >> >>>> > > conversion
> >>> >> >>>> > > to
> >>> >> >>>> > > radians in the DistanceFilter, which needs to be extremely
> >>> >> >>>> > > efficient
> >>> >> >>>> > > as it's not only applied on the resultset but potentially
> on
> >>> >> >>>> > > the
> >>> >> >>>> > > whole
> >>> >> >>>> > > corpus of all Documents in the index.
> >>> >> >>>> > > So even if it's true that conversion would be needed on the
> >>> >> >>>> > > final
> >>> >> >>>> > > results, we always expect people to retrieve only a limited
> >>> >> >>>> > > amount
> >>> >> >>>> > > of
> >>> >> >>>> > > entities (like with pagination), while the index might need
> >>> >> >>>> > > to
> >>> >> >>>> > > perform
> >>> >> >>>> > > this computation millions of times per query.
> >>> >> >>>> > >
> >>> >> >>>> > > If I look at the complexity of Point.getDistanceTo(double,
> >>> >> >>>> > > double),
> >>> >> >>>> > > I
> >>> >> >>>> > > get a feeling that that method will hardly provide speedy
> >>> >> >>>> > > queries
> >>> >> >>>> > > because of the complex computations in it - this is just
> >>> >> >>>> > > speculation
> >>> >> >>>> > > at this point of course, to be sure we'd need to compare
> them
> >>> >> >>>> > > with a
> >>> >> >>>> > > large enough dataset, but it seems quite obvious that
> storing
> >>> >> >>>> > > normalized radians should be more efficient as it would
> avoid
> >>> >> >>>> > > a
> >>> >> >>>> > > good
> >>> >> >>>> > > deal of math to be executed on each Document in the index.
> >>> >> >>>> > >
> >>> >> >>>> > > Also if we assume people might want to use radians in their
> >>> >> >>>> > > user
> >>> >> >>>> > > data
> >>> >> >>>> > > (I know some who definitely would never touch decimals for
> >>> >> >>>> > > such a
> >>> >> >>>> > > use
> >>> >> >>>> > > case), there would be no need at all to convert the end
> >>> >> >>>> > > result.
> >>> >> >>>> > >
> >>> >> >>>> > > Some more thoughts inline:
> >>> >> >>>> > >
> >>> >> >>>> > > On 3 May 2012 09:12, Nicolas Helleringer
> >>> >> >>>> > > <nicolas.helleringer at gmail.com> wrote:
> >>> >> >>>> > >> Hi all,
> >>> >> >>>> > >>
> >>> >> >>>> > >> Sanne and I have been wondering about the way the spatial
> >>> >> >>>> > >> branch/module/functionality for Hibernate Search shall
> store
> >>> >> >>>> > >> its
> >>> >> >>>> > >> coordinates in the Lucene index.
> >>> >> >>>> > >>
> >>> >> >>>> > >> Today it is implemented with decimal degree for :
> >>> >> >>>> > >> - easy debugging/readability
> >>> >> >>>> > >> - ease of conversion on storage as we want to accept
> mainly
> >>> >> >>>> > >> decimal
> >>> >> >>>> > >> degree
> >>> >> >>>> > >> from users data
> >>> >> >>>> > >
> >>> >> >>>> > > Valid points, but consider that "storage" is going to be
> way
> >>> >> >>>> > > slower
> >>> >> >>>> > > anyway, and typically you'll process a Document to evaluate
> >>> >> >>>> > > it
> >>> >> >>>> > > for a
> >>> >> >>>> > > hit many many orders of magnitude more frequently than the
> >>> >> >>>> > > times
> >>> >> >>>> > > you
> >>> >> >>>> > > store it.
> >>> >> >>>> > >
> >>> >> >>>> > >>
> >>> >> >>>> > >> Sanne pointed out that when the search is done there is
> >>> >> >>>> > >> quite a
> >>> >> >>>> > >> few
> >>> >> >>>> > >> conversion to radians for distance calculation and
> suggested
> >>> >> >>>> > >> that
> >>> >> >>>> > >> we may
> >>> >> >>>> > >> store directly coordinates under their radians form.
> >>> >> >>>> > >>
> >>> >> >>>> > >> I have tried a patch to implement this and as I was coding
> >>> >> >>>> > >> it I
> >>> >> >>>> > >> feel that
> >>> >> >>>> > >> the code was less readable, in the coordinates
> normalisation
> >>> >> >>>> > >> mainly
> >>> >> >>>> > >> and
> >>> >> >>>> > >> that there was as many conversion as before.
> >>> >> >>>> > >> Conversions had moved from search to import / export of
> >>> >> >>>> > >> coordinates
> >>> >> >>>> > >> in and
> >>> >> >>>> > >> out the spatial module scope to user scope.
> >>> >> >>>> > >
> >>> >> >>>> > > I'm sure the amount of points in the code in which they are
> >>> >> >>>> > > converted
> >>> >> >>>> > > won't change. I'm concerned about the cardinality of the
> >>> >> >>>> > > collections
> >>> >> >>>> > > on which it's applied ;)
> >>> >> >>>> > > "Less readable" isn't nice, but we can work on that I
> guess?
> >>> >> >>>> > >
> >>> >> >>>> > >>
> >>> >> >>>> > >> What the docs does not tell (yet), is that we are waiting
> >>> >> >>>> > >> for
> >>> >> >>>> > >> WGS
> >>> >> >>>> > >> 84 (this
> >>> >> >>>> > >> is a coordinate system) decimal degree coordinates input,
> as
> >>> >> >>>> > >> these
> >>> >> >>>> > >> are
> >>> >> >>>> > >> quite a de facto standard (GPS output this way).
> >>> >> >>>> > >
> >>> >> >>>> > > How does it affect this?
> >>> >> >>>> > >
> >>> >> >>>> > >>
> >>> >> >>>> > >> Today this is not the purpose of Hibernate Search spatial
> >>> >> >>>> > >> initiative to
> >>> >> >>>> > >> handle projections. There are opensource libs to handle
> that
> >>> >> >>>> > >> on
> >>> >> >>>> > >> user side
> >>> >> >>>> > >> very well (Proj4j)
> >>> >> >>>> > >>
> >>> >> >>>> > >> So. The question is : shall we store as radians or decimal
> >>> >> >>>> > >> degree ?
> >>> >> >>>> > >>
> >>> >> >>>> > >> Niko
> >>> >> >>>> > >>
> >>> >> >>>> > >> P.S : Hope it is clear. If not ask for more.
> >>> >> >>>> > >
> >>> >> >>>> > > Thanks!
> >>> >> >>>> > > Sanne
> >>> >> >>>> > > _______________________________________________
> >>> >> >>>> > > hibernate-dev mailing list
> >>> >> >>>> > > hibernate-dev at lists.jboss.org
> >>> >> >>>> > > https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >>> >> >>>> >
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>
> >>> >> >
> >>> >
> >>> >
> >>
> >>
> >
>
More information about the hibernate-dev
mailing list