[hibernate-dev] Coordinates storage in Lucene index for spatial functionality

Nicolas Helleringer nicolas.helleringer at gmail.com
Mon May 14 16:36:49 EDT 2012


Hi Sanne,

# Functionality
> Why is the number of documents fetched  via Degrees approximately half
> of what you're fetching via Radians?
>
The test is randomized at this time. I do not have a fixed test set :s sorry
In one run the same points are use for the 4 type of request.
But switching radians<=>degrees branch and then it is a whole new test.
I ll work on that to have a fixed center points set.


> This seems to be true for both "Grid" and "DoubleRange", while the
> numbers (approximately) match when you apply the Distance Filter as
> well.
>
The dynamic is the same on the two approach :
1°) First having a super set of candidates by querying the Lucene Index
(avoiding distance calculation again the whole dataset)
2°) Refining candidates by true distance calculation


> It was my understanding that using either Degrees or Radians in the
> index should just be a performance difference, with no difference in
> functionality (like result numbers) ?
>
Yes as soon as the dataset will be non random each run


> Applying the Distance Filter seems to fix the result, but this filter
> is optional right?
>
For exact answer to the question 'What are the document at x km max of
point (lat,long), no it is not optional as the
first request to the Lucene index returns false candidates.


# Performance
> The overhead mismatch you point out sheds some suspicion on the
> overall test validity as I see no reason either.. how long are you
> running these tests? I've had very bad experiences with short
> performance tests; I've since learnt do use -XX:+PrintCompilation to
> know how long (at the bare minimum) I have to run any test before JIT
> has finished compiling all the relevant code.
>
Right.
I will run much more longer tests.

Thanks Sanne


> On 11 May 2012 08:40, Nicolas Helleringer <nicolas.helleringer at gmail.com>
> wrote:
> > There, back and again ...
> >
> > After fixing a bug in grid search here are some updated results on 2k
> calls
> >
> > Degrees :
> > Mean time with Grid : 4.4897266425641025 ms. Average number of docs
>  fetched
> > : 2506.96
> > Mean time with Grid + Distance filter : 6.4930799487179485 ms. Average
> > number of docs  fetched : 425.33435897435896
> > Mean time with DoubleRange : 14.430638703076923 ms. Average number of
> docs
> >  fetched : 542.0410256410256
> > Mean time with DoubleRange + Distance filter : 20.483300545128206 ms.
> > Average number of docs  fetched : 425.33435897435896
> >
> > Radians :
> > Mean time with Grid : 5.650845744102564 ms. Average number of docs
>  fetched
> > : 5074.830769230769
> > Mean time with Grid + Distance filter : 8.627138825128204 ms. Average
> number
> > of docs  fetched : 426.7902564102564
> > Mean time with DoubleRange : 15.337755502564102 ms. Average number of
> docs
> >  fetched : 1087.705641025641
> > Mean time with DoubleRange + Distance filter : 20.82852138769231 ms.
> Average
> > number of docs  fetched : 426.7902564102564
> >
> > Next thing I do not explain yet is the distance filter overhead mismatch
> :
> > It is less on grid search with more docs to test than on DoubleRange.
> >
> > Niko
> >
> >
> > 2012/5/7 Nicolas Helleringer <nicolas.helleringer at gmail.com>
> >>
> >> Here are some results :
> >>
> >> Mean time with Grid : 4.9297471630769225 ms. Average number of docs
> >>  fetched : 2416.373846153846
> >> Mean time with Grid + Distance filter : 6.48634534 ms. Average number of
> >> docs  fetched : 425.84
> >> Mean time with DoubleRange : 15.39593650051282 ms. Average number of
> docs
> >>  fetched : 542.72
> >> Mean time with DoubleRange + Distance filter : 21.158394677435897 ms.
> >> Average number of docs  fetched : 425.8779487179487
> >>
> >> Sounds weird that with distance filter the two results are note the
> same.
> >> I shall investigate that.
> >>
> >> Niko
> >>
> >> 2012/5/7 Emmanuel Bernard <emmanuel at hibernate.org>
> >>>
> >>> Do you know the average amount of POI that were filtered in memory but
> >>> the DistanceFilter during these runs?
> >>>
> >>> Emmanuel
> >>>
> >>> On 7 mai 2012, at 10:31, Nicolas Helleringer wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I have done a radian patch/branch and some benchmarks on geonames
> french
> >>> database.
> >>>
> >>> Benchs are on 2k calls each run.
> >>>
> >>> Radians:
> >>> run 1
> >>> Mean time with Grid : 4.808043092820513 ms
> >>> Mean time with Grid + Distance filter : 6.571108878461538 ms
> >>> Mean time with DoubleRange : 14.62661525128205 ms
> >>> Mean time with DoubleRange + Distance filter : 20.143597923076925 ms
> >>>
> >>> run 2
> >>> Mean time with Grid : 5.290368523076923 ms
> >>> Mean time with Grid + Distance filter : 6.706567517435897 ms
> >>> Mean time with DoubleRange : 14.878960702564102 ms
> >>> Mean time with DoubleRange + Distance filter : 20.75806591948718 ms
> >>>
> >>> Degrees:
> >>> run 1
> >>> Mean time with Grid : 5.101956610769231 ms
> >>> Mean time with Grid + Distance filter : 6.548685109230769 ms
> >>> Mean time with DoubleRange : 14.767478146153845 ms
> >>> Mean time with DoubleRange + Distance filter : 20.668063972820512 ms
> >>>
> >>> run 2
> >>> Mean time with Grid : 4.683360031282051 ms
> >>> Mean time with Grid + Distance filter : 6.7065247435897435 ms
> >>> Mean time with DoubleRange : 14.617140157948716 ms
> >>> Mean time with DoubleRange + Distance filter : 20.074868595897435 ms
> >>>
> >>> The radian branch is here for review
> >>> :
> https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
> >>>
> >>> While moving from degrees to radians I have seen that DSL has still
> some
> >>> work to do.
> >>> I shall focus on that now.
> >>>
> >>> Niko
> >>>
> >>> 2012/5/3 Sanne Grinovero <sanne at hibernate.org>
> >>>>
> >>>>
> >>>> On May 3, 2012 10:10 AM, "Emmanuel Bernard" <emmanuel at hibernate.org>
> >>>> wrote:
> >>>> >
> >>>> > How comes the DistanceFilter has to compute the distance for the
> whole
> >>>> > corpus?
> >>>>
> >>>> You're right in that's not always the case, but it's possible. If
> there
> >>>> are more filters enabled and they are executed first, our filter will
> need
> >>>> to do the math only on the matched documents by the previous filters,
> but if
> >>>> there are no other constraints or filters our DistanceFilter might
> need to
> >>>> process all documents in all segments. This happens also when a limit
> is
> >>>> enabled on the collector - although limited to the current index
> segment -
> >>>> when the filter needs to be cached as it needs to evaluate each
> document in
> >>>> the segment.
> >>>>
> >>>> In our case this DistanceFilter is only applied after RangeQuery was
> >>>> applied on both longitude and latitude, so I'm not sure if this is a
> big
> >>>> problem; personally I was just wondering but I'd be fine in keeping
> this as
> >>>> a possible future improvement - but if we go for a separate issue,
> let's
> >>>> keep in mind that that the index format would not be backwards
> compatible.
> >>>>
> >>>>
> >>>>
> >>>> > By the way the actual storage (say via Hibernate ORM, or Infinispan)
> >>>> > does not need to store in radian, so we don't need to do a
> conversion when
> >>>> > reading an entity.
> >>>>
> >>>> Right, another reason to index only in whatever format makes querying
> >>>> more efficient.
> >>>>
> >>>> -- Sanne
> >>>>
> >>>>
> >>>> >
> >>>> > On 3 mai 2012, at 10:45, Sanne Grinovero wrote:
> >>>> >
> >>>> > > The reason for my comment is that the code is doing a conversion
> to
> >>>> > > radians in the DistanceFilter, which needs to be extremely
> efficient
> >>>> > > as it's not only applied on the resultset but potentially on the
> >>>> > > whole
> >>>> > > corpus of all Documents in the index.
> >>>> > > So even if it's true that conversion would be needed on the final
> >>>> > > results, we always expect people to retrieve only a limited amount
> >>>> > > of
> >>>> > > entities (like with pagination), while the index might need to
> >>>> > > perform
> >>>> > > this computation millions of times per query.
> >>>> > >
> >>>> > > If I look at the complexity of Point.getDistanceTo(double,
> double),
> >>>> > > I
> >>>> > > get a feeling that that method will hardly provide speedy queries
> >>>> > > because of the complex computations in it - this is just
> speculation
> >>>> > > at this point of course, to be sure we'd need to compare them
> with a
> >>>> > > large enough dataset, but it seems quite obvious that storing
> >>>> > > normalized radians should be more efficient as it would avoid a
> good
> >>>> > > deal of math to be executed on each Document in the index.
> >>>> > >
> >>>> > > Also if we assume people might want to use radians in their user
> >>>> > > data
> >>>> > > (I know some who definitely would never touch decimals for such a
> >>>> > > use
> >>>> > > case), there would be no need at all to convert the end result.
> >>>> > >
> >>>> > > Some more thoughts inline:
> >>>> > >
> >>>> > > On 3 May 2012 09:12, Nicolas Helleringer
> >>>> > > <nicolas.helleringer at gmail.com> wrote:
> >>>> > >> Hi all,
> >>>> > >>
> >>>> > >> Sanne and I have been wondering about the way the spatial
> >>>> > >> branch/module/functionality for Hibernate Search shall store its
> >>>> > >> coordinates in the Lucene index.
> >>>> > >>
> >>>> > >> Today it is implemented with decimal degree for :
> >>>> > >> - easy debugging/readability
> >>>> > >> - ease of conversion on storage as we want to accept mainly
> decimal
> >>>> > >> degree
> >>>> > >> from users data
> >>>> > >
> >>>> > > Valid points, but consider that "storage" is going to be way
> slower
> >>>> > > anyway, and typically you'll process a Document to evaluate it
> for a
> >>>> > > hit many many orders of magnitude more frequently than the times
> you
> >>>> > > store it.
> >>>> > >
> >>>> > >>
> >>>> > >> Sanne pointed out that when the search is done there is quite a
> few
> >>>> > >> conversion to radians for distance calculation and suggested that
> >>>> > >> we may
> >>>> > >> store directly coordinates under their radians form.
> >>>> > >>
> >>>> > >> I have tried a patch to implement this and as I was coding it I
> >>>> > >> feel that
> >>>> > >> the code was less readable, in the coordinates normalisation
> mainly
> >>>> > >> and
> >>>> > >> that there was as many conversion as before.
> >>>> > >> Conversions had moved from search to import / export of
> coordinates
> >>>> > >> in and
> >>>> > >> out the spatial module scope to user scope.
> >>>> > >
> >>>> > > I'm sure the amount of points in the code in which they are
> >>>> > > converted
> >>>> > > won't change. I'm concerned about the cardinality of the
> collections
> >>>> > > on which it's applied ;)
> >>>> > > "Less readable" isn't nice, but we can work on that I guess?
> >>>> > >
> >>>> > >>
> >>>> > >> What the docs does not tell (yet), is that we are waiting for WGS
> >>>> > >> 84 (this
> >>>> > >> is a coordinate system) decimal degree coordinates input, as
> these
> >>>> > >> are
> >>>> > >> quite a de facto standard (GPS output this way).
> >>>> > >
> >>>> > > How does it affect this?
> >>>> > >
> >>>> > >>
> >>>> > >> Today this is not the purpose of Hibernate Search spatial
> >>>> > >> initiative to
> >>>> > >> handle projections. There are opensource libs to handle that on
> >>>> > >> user side
> >>>> > >> very well (Proj4j)
> >>>> > >>
> >>>> > >> So. The question is : shall we store as radians or decimal
> degree ?
> >>>> > >>
> >>>> > >> Niko
> >>>> > >>
> >>>> > >> P.S : Hope it is clear. If not ask for more.
> >>>> > >
> >>>> > > Thanks!
> >>>> > > Sanne
> >>>> > > _______________________________________________
> >>>> > > hibernate-dev mailing list
> >>>> > > hibernate-dev at lists.jboss.org
> >>>> > > https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >>>> >
> >>>
> >>>
> >>>
> >>
> >
>


More information about the hibernate-dev mailing list