[hibernate-dev] Coordinates storage in Lucene index for spatial functionality

Nicolas Helleringer nicolas.helleringer at gmail.com
Fri May 11 03:40:11 EDT 2012


There, back and again ...

After fixing a bug in grid search here are some updated results on 2k calls

Degrees :
Mean time with Grid : 4.4897266425641025 ms. Average number of docs
 fetched : 2506.96
Mean time with Grid + Distance filter : 6.4930799487179485 ms. Average
number of docs  fetched : 425.33435897435896
Mean time with DoubleRange : 14.430638703076923 ms. Average number of docs
 fetched : 542.0410256410256
Mean time with DoubleRange + Distance filter : 20.483300545128206 ms.
Average number of docs  fetched : 425.33435897435896

Radians :
Mean time with Grid : 5.650845744102564 ms. Average number of docs  fetched
: 5074.830769230769
Mean time with Grid + Distance filter : 8.627138825128204 ms. Average
number of docs  fetched : 426.7902564102564
Mean time with DoubleRange : 15.337755502564102 ms. Average number of docs
 fetched : 1087.705641025641
Mean time with DoubleRange + Distance filter : 20.82852138769231 ms.
Average number of docs  fetched : 426.7902564102564

Next thing I do not explain yet is the distance filter overhead mismatch :
It is less on grid search with more docs to test than on DoubleRange.

Niko


2012/5/7 Nicolas Helleringer <nicolas.helleringer at gmail.com>

> Here are some results :
>
> Mean time with Grid : 4.9297471630769225 ms. Average number of docs
>  fetched : 2416.373846153846
> Mean time with Grid + Distance filter : 6.48634534 ms. Average number of
> docs  fetched : 425.84
> Mean time with DoubleRange : 15.39593650051282 ms. Average number of docs
>  fetched : 542.72
> Mean time with DoubleRange + Distance filter : 21.158394677435897 ms.
> Average number of docs  fetched : 425.8779487179487
>
> Sounds weird that with distance filter the two results are note the same.
> I shall investigate that.
>
> Niko
>
> 2012/5/7 Emmanuel Bernard <emmanuel at hibernate.org>
>
>> Do you know the average amount of POI that were filtered in memory but
>> the DistanceFilter during these runs?
>>
>> Emmanuel
>>
>> On 7 mai 2012, at 10:31, Nicolas Helleringer wrote:
>>
>> Hi all,
>>
>> I have done a radian patch/branch and some benchmarks on geonames french
>> database.
>>
>> Benchs are on 2k calls each run.
>>
>> Radians:
>> run 1
>> Mean time with Grid : 4.808043092820513 ms
>> Mean time with Grid + Distance filter : 6.571108878461538 ms
>> Mean time with DoubleRange : 14.62661525128205 ms
>> Mean time with DoubleRange + Distance filter : 20.143597923076925 ms
>>
>> run 2
>> Mean time with Grid : 5.290368523076923 ms
>> Mean time with Grid + Distance filter : 6.706567517435897 ms
>> Mean time with DoubleRange : 14.878960702564102 ms
>> Mean time with DoubleRange + Distance filter : 20.75806591948718 ms
>>
>> Degrees:
>> run 1
>> Mean time with Grid : 5.101956610769231 ms
>> Mean time with Grid + Distance filter : 6.548685109230769 ms
>> Mean time with DoubleRange : 14.767478146153845 ms
>> Mean time with DoubleRange + Distance filter : 20.668063972820512 ms
>>
>> run 2
>> Mean time with Grid : 4.683360031282051 ms
>> Mean time with Grid + Distance filter : 6.7065247435897435 ms
>> Mean time with DoubleRange : 14.617140157948716 ms
>> Mean time with DoubleRange + Distance filter : 20.074868595897435 ms
>>
>> The radian branch is here for review :
>> https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
>>
>> While moving from degrees to radians I have seen that DSL has still some
>> work to do.
>> I shall focus on that now.
>>
>> Niko
>>
>> 2012/5/3 Sanne Grinovero <sanne at hibernate.org>
>>
>>>
>>> On May 3, 2012 10:10 AM, "Emmanuel Bernard" <emmanuel at hibernate.org>
>>> wrote:
>>> >
>>> > How comes the DistanceFilter has to compute the distance for the whole
>>> corpus?
>>>
>>> You're right in that's not always the case, but it's possible. If there
>>> are more filters enabled and they are executed first, our filter will need
>>> to do the math only on the matched documents by the previous filters, but
>>> if there are no other constraints or filters our DistanceFilter might need
>>> to process all documents in all segments. This happens also when a limit is
>>> enabled on the collector - although limited to the current index segment -
>>> when the filter needs to be cached as it needs to evaluate each document in
>>> the segment.
>>>
>>> In our case this DistanceFilter is only applied after RangeQuery was
>>> applied on both longitude and latitude, so I'm not sure if this is a big
>>> problem; personally I was just wondering but I'd be fine in keeping this as
>>> a possible future improvement - but if we go for a separate issue, let's
>>> keep in mind that that the index format would not be backwards compatible.
>>>
>>>
>>> > By the way the actual storage (say via Hibernate ORM, or Infinispan)
>>> does not need to store in radian, so we don't need to do a conversion when
>>> reading an entity.
>>>
>>> Right, another reason to index only in whatever format makes querying
>>> more efficient.
>>>
>>> -- Sanne
>>>
>>>
>>> >
>>> > On 3 mai 2012, at 10:45, Sanne Grinovero wrote:
>>> >
>>> > > The reason for my comment is that the code is doing a conversion to
>>> > > radians in the DistanceFilter, which needs to be extremely efficient
>>> > > as it's not only applied on the resultset but potentially on the
>>> whole
>>> > > corpus of all Documents in the index.
>>> > > So even if it's true that conversion would be needed on the final
>>> > > results, we always expect people to retrieve only a limited amount of
>>> > > entities (like with pagination), while the index might need to
>>> perform
>>> > > this computation millions of times per query.
>>> > >
>>> > > If I look at the complexity of Point.getDistanceTo(double, double), I
>>> > > get a feeling that that method will hardly provide speedy queries
>>> > > because of the complex computations in it - this is just speculation
>>> > > at this point of course, to be sure we'd need to compare them with a
>>> > > large enough dataset, but it seems quite obvious that storing
>>> > > normalized radians should be more efficient as it would avoid a good
>>> > > deal of math to be executed on each Document in the index.
>>> > >
>>> > > Also if we assume people might want to use radians in their user data
>>> > > (I know some who definitely would never touch decimals for such a use
>>> > > case), there would be no need at all to convert the end result.
>>> > >
>>> > > Some more thoughts inline:
>>> > >
>>> > > On 3 May 2012 09:12, Nicolas Helleringer <
>>> nicolas.helleringer at gmail.com> wrote:
>>> > >> Hi all,
>>> > >>
>>> > >> Sanne and I have been wondering about the way the spatial
>>> > >> branch/module/functionality for Hibernate Search shall store its
>>> > >> coordinates in the Lucene index.
>>> > >>
>>> > >> Today it is implemented with decimal degree for :
>>> > >> - easy debugging/readability
>>> > >> - ease of conversion on storage as we want to accept mainly decimal
>>> degree
>>> > >> from users data
>>> > >
>>> > > Valid points, but consider that "storage" is going to be way slower
>>> > > anyway, and typically you'll process a Document to evaluate it for a
>>> > > hit many many orders of magnitude more frequently than the times you
>>> > > store it.
>>> > >
>>> > >>
>>> > >> Sanne pointed out that when the search is done there is quite a few
>>> > >> conversion to radians for distance calculation and suggested that
>>> we may
>>> > >> store directly coordinates under their radians form.
>>> > >>
>>> > >> I have tried a patch to implement this and as I was coding it I
>>> feel that
>>> > >> the code was less readable, in the coordinates normalisation mainly
>>> and
>>> > >> that there was as many conversion as before.
>>> > >> Conversions had moved from search to import / export of coordinates
>>> in and
>>> > >> out the spatial module scope to user scope.
>>> > >
>>> > > I'm sure the amount of points in the code in which they are converted
>>> > > won't change. I'm concerned about the cardinality of the collections
>>> > > on which it's applied ;)
>>> > > "Less readable" isn't nice, but we can work on that I guess?
>>> > >
>>> > >>
>>> > >> What the docs does not tell (yet), is that we are waiting for WGS
>>> 84 (this
>>> > >> is a coordinate system) decimal degree coordinates input, as these
>>> are
>>> > >> quite a de facto standard (GPS output this way).
>>> > >
>>> > > How does it affect this?
>>> > >
>>> > >>
>>> > >> Today this is not the purpose of Hibernate Search spatial
>>> initiative to
>>> > >> handle projections. There are opensource libs to handle that on
>>> user side
>>> > >> very well (Proj4j)
>>> > >>
>>> > >> So. The question is : shall we store as radians or decimal degree ?
>>> > >>
>>> > >> Niko
>>> > >>
>>> > >> P.S : Hope it is clear. If not ask for more.
>>> > >
>>> > > Thanks!
>>> > > Sanne
>>> > > _______________________________________________
>>> > > hibernate-dev mailing list
>>> > > hibernate-dev at lists.jboss.org
>>> > > https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>> >
>>>
>>
>>
>>
>


More information about the hibernate-dev mailing list