On 14 May 2012 21:36, Nicolas Helleringer <nicolas.helleringer(a)gmail.com> wrote:
Hi Sanne,
> # Functionality
> Why is the number of documents fetched via Degrees approximately half
> of what you're fetching via Radians?
The test is randomized at this time. I do not have a fixed test set :s sorry
In one run the same points are use for the 4 type of request.
But switching radians<=>degrees branch and then it is a whole new test.
I ll work on that to have a fixed center points set.
Ha that helps to understand these figures, thanks!
Could you make sure they are run in the same conditions and input
data? You could either dump the testdata as serialized into a file, or
maybe even simpler set a constant as the seed of your random
generator: should provide a reproducible sequence of values.
It's obviously very important to use the same data set to be able to
compare them in some way.
>
> This seems to be true for both "Grid" and "DoubleRange", while
the
> numbers (approximately) match when you apply the Distance Filter as
> well.
The dynamic is the same on the two approach :
1°) First having a super set of candidates by querying the Lucene Index
(avoiding distance calculation again the whole dataset)
2°) Refining candidates by true distance calculation
>
> It was my understanding that using either Degrees or Radians in the
> index should just be a performance difference, with no difference in
> functionality (like result numbers) ?
Yes as soon as the dataset will be non random each run
>
> Applying the Distance Filter seems to fix the result, but this filter
> is optional right?
For exact answer to the question 'What are the document at x km max of point
(lat,long), no it is not optional as the
first request to the Lucene index returns false candidates.
Sure, I should have said "optional for better precision". It just
looked like one was misbehaving, but since they are random..
> # Performance
> The overhead mismatch you point out sheds some suspicion on the
> overall test validity as I see no reason either.. how long are you
> running these tests? I've had very bad experiences with short
> performance tests; I've since learnt do use -XX:+PrintCompilation to
> know how long (at the bare minimum) I have to run any test before JIT
> has finished compiling all the relevant code.
Right.
I will run much more longer tests.
Thanks Sanne
Thanks! looking forward to more numbers to think about :)
Cheers,
Sanne
>
> On 11 May 2012 08:40, Nicolas Helleringer <nicolas.helleringer(a)gmail.com>
> wrote:
> > There, back and again ...
> >
> > After fixing a bug in grid search here are some updated results on 2k
> > calls
> >
> > Degrees :
> > Mean time with Grid : 4.4897266425641025 ms. Average number of docs
> > fetched
> > : 2506.96
> > Mean time with Grid + Distance filter : 6.4930799487179485 ms. Average
> > number of docs fetched : 425.33435897435896
> > Mean time with DoubleRange : 14.430638703076923 ms. Average number of
> > docs
> > fetched : 542.0410256410256
> > Mean time with DoubleRange + Distance filter : 20.483300545128206 ms.
> > Average number of docs fetched : 425.33435897435896
> >
> > Radians :
> > Mean time with Grid : 5.650845744102564 ms. Average number of docs
> > fetched
> > : 5074.830769230769
> > Mean time with Grid + Distance filter : 8.627138825128204 ms. Average
> > number
> > of docs fetched : 426.7902564102564
> > Mean time with DoubleRange : 15.337755502564102 ms. Average number of
> > docs
> > fetched : 1087.705641025641
> > Mean time with DoubleRange + Distance filter : 20.82852138769231 ms.
> > Average
> > number of docs fetched : 426.7902564102564
> >
> > Next thing I do not explain yet is the distance filter overhead mismatch
> > :
> > It is less on grid search with more docs to test than on DoubleRange.
> >
> > Niko
> >
> >
> > 2012/5/7 Nicolas Helleringer <nicolas.helleringer(a)gmail.com>
> >>
> >> Here are some results :
> >>
> >> Mean time with Grid : 4.9297471630769225 ms. Average number of docs
> >> fetched : 2416.373846153846
> >> Mean time with Grid + Distance filter : 6.48634534 ms. Average number
> >> of
> >> docs fetched : 425.84
> >> Mean time with DoubleRange : 15.39593650051282 ms. Average number of
> >> docs
> >> fetched : 542.72
> >> Mean time with DoubleRange + Distance filter : 21.158394677435897 ms.
> >> Average number of docs fetched : 425.8779487179487
> >>
> >> Sounds weird that with distance filter the two results are note the
> >> same.
> >> I shall investigate that.
> >>
> >> Niko
> >>
> >> 2012/5/7 Emmanuel Bernard <emmanuel(a)hibernate.org>
> >>>
> >>> Do you know the average amount of POI that were filtered in memory but
> >>> the DistanceFilter during these runs?
> >>>
> >>> Emmanuel
> >>>
> >>> On 7 mai 2012, at 10:31, Nicolas Helleringer wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I have done a radian patch/branch and some benchmarks on geonames
> >>> french
> >>> database.
> >>>
> >>> Benchs are on 2k calls each run.
> >>>
> >>> Radians:
> >>> run 1
> >>> Mean time with Grid : 4.808043092820513 ms
> >>> Mean time with Grid + Distance filter : 6.571108878461538 ms
> >>> Mean time with DoubleRange : 14.62661525128205 ms
> >>> Mean time with DoubleRange + Distance filter : 20.143597923076925 ms
> >>>
> >>> run 2
> >>> Mean time with Grid : 5.290368523076923 ms
> >>> Mean time with Grid + Distance filter : 6.706567517435897 ms
> >>> Mean time with DoubleRange : 14.878960702564102 ms
> >>> Mean time with DoubleRange + Distance filter : 20.75806591948718 ms
> >>>
> >>> Degrees:
> >>> run 1
> >>> Mean time with Grid : 5.101956610769231 ms
> >>> Mean time with Grid + Distance filter : 6.548685109230769 ms
> >>> Mean time with DoubleRange : 14.767478146153845 ms
> >>> Mean time with DoubleRange + Distance filter : 20.668063972820512 ms
> >>>
> >>> run 2
> >>> Mean time with Grid : 4.683360031282051 ms
> >>> Mean time with Grid + Distance filter : 6.7065247435897435 ms
> >>> Mean time with DoubleRange : 14.617140157948716 ms
> >>> Mean time with DoubleRange + Distance filter : 20.074868595897435 ms
> >>>
> >>> The radian branch is here for review
> >>>
> >>>
: https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
> >>>
> >>> While moving from degrees to radians I have seen that DSL has still
> >>> some
> >>> work to do.
> >>> I shall focus on that now.
> >>>
> >>> Niko
> >>>
> >>> 2012/5/3 Sanne Grinovero <sanne(a)hibernate.org>
> >>>>
> >>>>
> >>>> On May 3, 2012 10:10 AM, "Emmanuel Bernard"
<emmanuel(a)hibernate.org>
> >>>> wrote:
> >>>> >
> >>>> > How comes the DistanceFilter has to compute the distance for
the
> >>>> > whole
> >>>> > corpus?
> >>>>
> >>>> You're right in that's not always the case, but it's
possible. If
> >>>> there
> >>>> are more filters enabled and they are executed first, our filter
will
> >>>> need
> >>>> to do the math only on the matched documents by the previous
filters,
> >>>> but if
> >>>> there are no other constraints or filters our DistanceFilter might
> >>>> need to
> >>>> process all documents in all segments. This happens also when a
limit
> >>>> is
> >>>> enabled on the collector - although limited to the current index
> >>>> segment -
> >>>> when the filter needs to be cached as it needs to evaluate each
> >>>> document in
> >>>> the segment.
> >>>>
> >>>> In our case this DistanceFilter is only applied after RangeQuery
was
> >>>> applied on both longitude and latitude, so I'm not sure if this
is a
> >>>> big
> >>>> problem; personally I was just wondering but I'd be fine in
keeping
> >>>> this as
> >>>> a possible future improvement - but if we go for a separate issue,
> >>>> let's
> >>>> keep in mind that that the index format would not be backwards
> >>>> compatible.
> >>>>
> >>>>
> >>>>
> >>>> > By the way the actual storage (say via Hibernate ORM, or
> >>>> > Infinispan)
> >>>> > does not need to store in radian, so we don't need to do a
> >>>> > conversion when
> >>>> > reading an entity.
> >>>>
> >>>> Right, another reason to index only in whatever format makes
querying
> >>>> more efficient.
> >>>>
> >>>> -- Sanne
> >>>>
> >>>>
> >>>> >
> >>>> > On 3 mai 2012, at 10:45, Sanne Grinovero wrote:
> >>>> >
> >>>> > > The reason for my comment is that the code is doing a
conversion
> >>>> > > to
> >>>> > > radians in the DistanceFilter, which needs to be
extremely
> >>>> > > efficient
> >>>> > > as it's not only applied on the resultset but
potentially on the
> >>>> > > whole
> >>>> > > corpus of all Documents in the index.
> >>>> > > So even if it's true that conversion would be needed
on the final
> >>>> > > results, we always expect people to retrieve only a
limited
> >>>> > > amount
> >>>> > > of
> >>>> > > entities (like with pagination), while the index might
need to
> >>>> > > perform
> >>>> > > this computation millions of times per query.
> >>>> > >
> >>>> > > If I look at the complexity of
Point.getDistanceTo(double,
> >>>> > > double),
> >>>> > > I
> >>>> > > get a feeling that that method will hardly provide speedy
queries
> >>>> > > because of the complex computations in it - this is just
> >>>> > > speculation
> >>>> > > at this point of course, to be sure we'd need to
compare them
> >>>> > > with a
> >>>> > > large enough dataset, but it seems quite obvious that
storing
> >>>> > > normalized radians should be more efficient as it would
avoid a
> >>>> > > good
> >>>> > > deal of math to be executed on each Document in the
index.
> >>>> > >
> >>>> > > Also if we assume people might want to use radians in
their user
> >>>> > > data
> >>>> > > (I know some who definitely would never touch decimals for
such a
> >>>> > > use
> >>>> > > case), there would be no need at all to convert the end
result.
> >>>> > >
> >>>> > > Some more thoughts inline:
> >>>> > >
> >>>> > > On 3 May 2012 09:12, Nicolas Helleringer
> >>>> > > <nicolas.helleringer(a)gmail.com> wrote:
> >>>> > >> Hi all,
> >>>> > >>
> >>>> > >> Sanne and I have been wondering about the way the
spatial
> >>>> > >> branch/module/functionality for Hibernate Search shall
store its
> >>>> > >> coordinates in the Lucene index.
> >>>> > >>
> >>>> > >> Today it is implemented with decimal degree for :
> >>>> > >> - easy debugging/readability
> >>>> > >> - ease of conversion on storage as we want to accept
mainly
> >>>> > >> decimal
> >>>> > >> degree
> >>>> > >> from users data
> >>>> > >
> >>>> > > Valid points, but consider that "storage" is
going to be way
> >>>> > > slower
> >>>> > > anyway, and typically you'll process a Document to
evaluate it
> >>>> > > for a
> >>>> > > hit many many orders of magnitude more frequently than the
times
> >>>> > > you
> >>>> > > store it.
> >>>> > >
> >>>> > >>
> >>>> > >> Sanne pointed out that when the search is done there
is quite a
> >>>> > >> few
> >>>> > >> conversion to radians for distance calculation and
suggested
> >>>> > >> that
> >>>> > >> we may
> >>>> > >> store directly coordinates under their radians form.
> >>>> > >>
> >>>> > >> I have tried a patch to implement this and as I was
coding it I
> >>>> > >> feel that
> >>>> > >> the code was less readable, in the coordinates
normalisation
> >>>> > >> mainly
> >>>> > >> and
> >>>> > >> that there was as many conversion as before.
> >>>> > >> Conversions had moved from search to import / export
of
> >>>> > >> coordinates
> >>>> > >> in and
> >>>> > >> out the spatial module scope to user scope.
> >>>> > >
> >>>> > > I'm sure the amount of points in the code in which
they are
> >>>> > > converted
> >>>> > > won't change. I'm concerned about the cardinality
of the
> >>>> > > collections
> >>>> > > on which it's applied ;)
> >>>> > > "Less readable" isn't nice, but we can work
on that I guess?
> >>>> > >
> >>>> > >>
> >>>> > >> What the docs does not tell (yet), is that we are
waiting for
> >>>> > >> WGS
> >>>> > >> 84 (this
> >>>> > >> is a coordinate system) decimal degree coordinates
input, as
> >>>> > >> these
> >>>> > >> are
> >>>> > >> quite a de facto standard (GPS output this way).
> >>>> > >
> >>>> > > How does it affect this?
> >>>> > >
> >>>> > >>
> >>>> > >> Today this is not the purpose of Hibernate Search
spatial
> >>>> > >> initiative to
> >>>> > >> handle projections. There are opensource libs to
handle that on
> >>>> > >> user side
> >>>> > >> very well (Proj4j)
> >>>> > >>
> >>>> > >> So. The question is : shall we store as radians or
decimal
> >>>> > >> degree ?
> >>>> > >>
> >>>> > >> Niko
> >>>> > >>
> >>>> > >> P.S : Hope it is clear. If not ask for more.
> >>>> > >
> >>>> > > Thanks!
> >>>> > > Sanne
> >>>> > > _______________________________________________
> >>>> > > hibernate-dev mailing list
> >>>> > > hibernate-dev(a)lists.jboss.org
> >>>> > >
https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >>>> >
> >>>
> >>>
> >>>
> >>
> >