Hi Nicolas, thanks looking better.
Could you now change it for longer runs? If I think of 2K invocations,
each taking 5ms, that's not more than 10 seconds.. It took me several
minutes to load all the data needed for the test, then just some
seconds to run the tests.. not worth the load time ;)
To consider aspects such as performance loss due to garbage generation
you'd need a steady run for 45 minutes at least, and take a look to
the results in the average of the second half of the test run. (and
not using System.out every 4 milliseconds)
On a side note, what do you need System.exit(0); for ? You should
close the SessionFactory.
Cheers,
Sanne
On 15 May 2012 14:04, Nicolas Helleringer <nicolas.helleringer(a)gmail.com> wrote:
I did the seed on the random generator.
Here are some results:
Degrees 2K calls
Mean time with Grid : 4.769457488717949 ms. Average number of docs fetched
: 2524.982564102564
Mean time with Grid + Distance filter : 6.501712946153845 ms. Average number
of docs fetched : 426.1876923076923
Mean time with DoubleRange : 14.336663392307692 ms. Average number of docs
fetched : 543.6035897435897
Mean time with DoubleRange + Distance filter : 19.7123163574359 ms. Average
number of docs fetched : 426.1876923076923
Radians 2K calls
Mean time with Grid : 4.430686068205128 ms. Average number of docs fetched
: 2524.982564102564
Mean time with Grid + Distance filter : 6.717519717948718 ms. Average number
of docs fetched : 426.1876923076923
Mean time with DoubleRange : 14.35186034 ms. Average number of docs fetched
: 543.6035897435897
Mean time with DoubleRange + Distance filter : 20.073972284102563 ms.
Average number of docs fetched : 426.1876923076923
Radians 50k calls
Mean time with Grid : 4.440979528643216 ms. Average number of docs fetched
: 2459.169386934673
Mean time with Grid + Distance filter : 6.722681398331658 ms. Average number
of docs fetched : 416.2335879396985
Mean time with DoubleRange : 14.532376860201005 ms. Average number of docs
fetched : 530.2923618090452
Mean time with DoubleRange + Distance filter : 20.21980649284422 ms. Average
number of docs fetched : 416.2335879396985
On the random part you can see by looking at the average umber of docs on
the 2k calls that the seed did its works, the requests are the same.
As you can see there is not such a difference between 2k and 50k calls runs.
What I have investigated too is the overhead of the distance filter over the
double range approach. I do fear that the wrapping
of the lat,long range query in a QueryWrapperFilter is costly but i cannnot
prove it, yet.
Back to the main question : does radian storage gives better performance ? I
cannot say with my test env. It seems pretty close to me.
Maybe if someone manages to launch the bench on a different environnement.
Niko
PS : both branches are up to date in my github
: https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923 & https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
2012/5/14 Nicolas Helleringer <nicolas.helleringer(a)gmail.com>
>>
>> maybe even simpler set a constant as the seed of your random
>> generator: should provide a reproducible sequence of values.
>
> /facepalm
> I should have guess that :s
>
> Niko
>
>>
>> >>
>> >> On 11 May 2012 08:40, Nicolas Helleringer
>> >> <nicolas.helleringer(a)gmail.com>
>> >> wrote:
>> >> > There, back and again ...
>> >> >
>> >> > After fixing a bug in grid search here are some updated results on
>> >> > 2k
>> >> > calls
>> >> >
>> >> > Degrees :
>> >> > Mean time with Grid : 4.4897266425641025 ms. Average number of
docs
>> >> > fetched
>> >> > : 2506.96
>> >> > Mean time with Grid + Distance filter : 6.4930799487179485 ms.
>> >> > Average
>> >> > number of docs fetched : 425.33435897435896
>> >> > Mean time with DoubleRange : 14.430638703076923 ms. Average number
>> >> > of
>> >> > docs
>> >> > fetched : 542.0410256410256
>> >> > Mean time with DoubleRange + Distance filter : 20.483300545128206
>> >> > ms.
>> >> > Average number of docs fetched : 425.33435897435896
>> >> >
>> >> > Radians :
>> >> > Mean time with Grid : 5.650845744102564 ms. Average number of docs
>> >> > fetched
>> >> > : 5074.830769230769
>> >> > Mean time with Grid + Distance filter : 8.627138825128204 ms.
>> >> > Average
>> >> > number
>> >> > of docs fetched : 426.7902564102564
>> >> > Mean time with DoubleRange : 15.337755502564102 ms. Average number
>> >> > of
>> >> > docs
>> >> > fetched : 1087.705641025641
>> >> > Mean time with DoubleRange + Distance filter : 20.82852138769231
ms.
>> >> > Average
>> >> > number of docs fetched : 426.7902564102564
>> >> >
>> >> > Next thing I do not explain yet is the distance filter overhead
>> >> > mismatch
>> >> > :
>> >> > It is less on grid search with more docs to test than on
>> >> > DoubleRange.
>> >> >
>> >> > Niko
>> >> >
>> >> >
>> >> > 2012/5/7 Nicolas Helleringer <nicolas.helleringer(a)gmail.com>
>> >> >>
>> >> >> Here are some results :
>> >> >>
>> >> >> Mean time with Grid : 4.9297471630769225 ms. Average number of
docs
>> >> >> fetched : 2416.373846153846
>> >> >> Mean time with Grid + Distance filter : 6.48634534 ms. Average
>> >> >> number
>> >> >> of
>> >> >> docs fetched : 425.84
>> >> >> Mean time with DoubleRange : 15.39593650051282 ms. Average
number
>> >> >> of
>> >> >> docs
>> >> >> fetched : 542.72
>> >> >> Mean time with DoubleRange + Distance filter :
21.158394677435897
>> >> >> ms.
>> >> >> Average number of docs fetched : 425.8779487179487
>> >> >>
>> >> >> Sounds weird that with distance filter the two results are note
the
>> >> >> same.
>> >> >> I shall investigate that.
>> >> >>
>> >> >> Niko
>> >> >>
>> >> >> 2012/5/7 Emmanuel Bernard <emmanuel(a)hibernate.org>
>> >> >>>
>> >> >>> Do you know the average amount of POI that were filtered in
memory
>> >> >>> but
>> >> >>> the DistanceFilter during these runs?
>> >> >>>
>> >> >>> Emmanuel
>> >> >>>
>> >> >>> On 7 mai 2012, at 10:31, Nicolas Helleringer wrote:
>> >> >>>
>> >> >>> Hi all,
>> >> >>>
>> >> >>> I have done a radian patch/branch and some benchmarks on
geonames
>> >> >>> french
>> >> >>> database.
>> >> >>>
>> >> >>> Benchs are on 2k calls each run.
>> >> >>>
>> >> >>> Radians:
>> >> >>> run 1
>> >> >>> Mean time with Grid : 4.808043092820513 ms
>> >> >>> Mean time with Grid + Distance filter : 6.571108878461538
ms
>> >> >>> Mean time with DoubleRange : 14.62661525128205 ms
>> >> >>> Mean time with DoubleRange + Distance filter :
20.143597923076925
>> >> >>> ms
>> >> >>>
>> >> >>> run 2
>> >> >>> Mean time with Grid : 5.290368523076923 ms
>> >> >>> Mean time with Grid + Distance filter : 6.706567517435897
ms
>> >> >>> Mean time with DoubleRange : 14.878960702564102 ms
>> >> >>> Mean time with DoubleRange + Distance filter :
20.75806591948718
>> >> >>> ms
>> >> >>>
>> >> >>> Degrees:
>> >> >>> run 1
>> >> >>> Mean time with Grid : 5.101956610769231 ms
>> >> >>> Mean time with Grid + Distance filter : 6.548685109230769
ms
>> >> >>> Mean time with DoubleRange : 14.767478146153845 ms
>> >> >>> Mean time with DoubleRange + Distance filter :
20.668063972820512
>> >> >>> ms
>> >> >>>
>> >> >>> run 2
>> >> >>> Mean time with Grid : 4.683360031282051 ms
>> >> >>> Mean time with Grid + Distance filter : 6.7065247435897435
ms
>> >> >>> Mean time with DoubleRange : 14.617140157948716 ms
>> >> >>> Mean time with DoubleRange + Distance filter :
20.074868595897435
>> >> >>> ms
>> >> >>>
>> >> >>> The radian branch is here for review
>> >> >>>
>> >> >>>
>> >> >>>
: https://github.com/nicolashelleringer/hibernate-search/tree/HSEARCH-923-RADIANS
>> >> >>>
>> >> >>> While moving from degrees to radians I have seen that DSL
has
>> >> >>> still
>> >> >>> some
>> >> >>> work to do.
>> >> >>> I shall focus on that now.
>> >> >>>
>> >> >>> Niko
>> >> >>>
>> >> >>> 2012/5/3 Sanne Grinovero <sanne(a)hibernate.org>
>> >> >>>>
>> >> >>>>
>> >> >>>> On May 3, 2012 10:10 AM, "Emmanuel Bernard"
>> >> >>>> <emmanuel(a)hibernate.org>
>> >> >>>> wrote:
>> >> >>>> >
>> >> >>>> > How comes the DistanceFilter has to compute the
distance for
>> >> >>>> > the
>> >> >>>> > whole
>> >> >>>> > corpus?
>> >> >>>>
>> >> >>>> You're right in that's not always the case, but
it's possible. If
>> >> >>>> there
>> >> >>>> are more filters enabled and they are executed first,
our filter
>> >> >>>> will
>> >> >>>> need
>> >> >>>> to do the math only on the matched documents by the
previous
>> >> >>>> filters,
>> >> >>>> but if
>> >> >>>> there are no other constraints or filters our
DistanceFilter
>> >> >>>> might
>> >> >>>> need to
>> >> >>>> process all documents in all segments. This happens
also when a
>> >> >>>> limit
>> >> >>>> is
>> >> >>>> enabled on the collector - although limited to the
current index
>> >> >>>> segment -
>> >> >>>> when the filter needs to be cached as it needs to
evaluate each
>> >> >>>> document in
>> >> >>>> the segment.
>> >> >>>>
>> >> >>>> In our case this DistanceFilter is only applied after
RangeQuery
>> >> >>>> was
>> >> >>>> applied on both longitude and latitude, so I'm not
sure if this
>> >> >>>> is a
>> >> >>>> big
>> >> >>>> problem; personally I was just wondering but I'd be
fine in
>> >> >>>> keeping
>> >> >>>> this as
>> >> >>>> a possible future improvement - but if we go for a
separate
>> >> >>>> issue,
>> >> >>>> let's
>> >> >>>> keep in mind that that the index format would not be
backwards
>> >> >>>> compatible.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> > By the way the actual storage (say via Hibernate
ORM, or
>> >> >>>> > Infinispan)
>> >> >>>> > does not need to store in radian, so we don't
need to do a
>> >> >>>> > conversion when
>> >> >>>> > reading an entity.
>> >> >>>>
>> >> >>>> Right, another reason to index only in whatever format
makes
>> >> >>>> querying
>> >> >>>> more efficient.
>> >> >>>>
>> >> >>>> -- Sanne
>> >> >>>>
>> >> >>>>
>> >> >>>> >
>> >> >>>> > On 3 mai 2012, at 10:45, Sanne Grinovero wrote:
>> >> >>>> >
>> >> >>>> > > The reason for my comment is that the code is
doing a
>> >> >>>> > > conversion
>> >> >>>> > > to
>> >> >>>> > > radians in the DistanceFilter, which needs to
be extremely
>> >> >>>> > > efficient
>> >> >>>> > > as it's not only applied on the resultset
but potentially on
>> >> >>>> > > the
>> >> >>>> > > whole
>> >> >>>> > > corpus of all Documents in the index.
>> >> >>>> > > So even if it's true that conversion
would be needed on the
>> >> >>>> > > final
>> >> >>>> > > results, we always expect people to retrieve
only a limited
>> >> >>>> > > amount
>> >> >>>> > > of
>> >> >>>> > > entities (like with pagination), while the
index might need
>> >> >>>> > > to
>> >> >>>> > > perform
>> >> >>>> > > this computation millions of times per
query.
>> >> >>>> > >
>> >> >>>> > > If I look at the complexity of
Point.getDistanceTo(double,
>> >> >>>> > > double),
>> >> >>>> > > I
>> >> >>>> > > get a feeling that that method will hardly
provide speedy
>> >> >>>> > > queries
>> >> >>>> > > because of the complex computations in it -
this is just
>> >> >>>> > > speculation
>> >> >>>> > > at this point of course, to be sure we'd
need to compare them
>> >> >>>> > > with a
>> >> >>>> > > large enough dataset, but it seems quite
obvious that storing
>> >> >>>> > > normalized radians should be more efficient
as it would avoid
>> >> >>>> > > a
>> >> >>>> > > good
>> >> >>>> > > deal of math to be executed on each Document
in the index.
>> >> >>>> > >
>> >> >>>> > > Also if we assume people might want to use
radians in their
>> >> >>>> > > user
>> >> >>>> > > data
>> >> >>>> > > (I know some who definitely would never touch
decimals for
>> >> >>>> > > such a
>> >> >>>> > > use
>> >> >>>> > > case), there would be no need at all to
convert the end
>> >> >>>> > > result.
>> >> >>>> > >
>> >> >>>> > > Some more thoughts inline:
>> >> >>>> > >
>> >> >>>> > > On 3 May 2012 09:12, Nicolas Helleringer
>> >> >>>> > > <nicolas.helleringer(a)gmail.com> wrote:
>> >> >>>> > >> Hi all,
>> >> >>>> > >>
>> >> >>>> > >> Sanne and I have been wondering about the
way the spatial
>> >> >>>> > >> branch/module/functionality for Hibernate
Search shall store
>> >> >>>> > >> its
>> >> >>>> > >> coordinates in the Lucene index.
>> >> >>>> > >>
>> >> >>>> > >> Today it is implemented with decimal
degree for :
>> >> >>>> > >> - easy debugging/readability
>> >> >>>> > >> - ease of conversion on storage as we
want to accept mainly
>> >> >>>> > >> decimal
>> >> >>>> > >> degree
>> >> >>>> > >> from users data
>> >> >>>> > >
>> >> >>>> > > Valid points, but consider that
"storage" is going to be way
>> >> >>>> > > slower
>> >> >>>> > > anyway, and typically you'll process a
Document to evaluate
>> >> >>>> > > it
>> >> >>>> > > for a
>> >> >>>> > > hit many many orders of magnitude more
frequently than the
>> >> >>>> > > times
>> >> >>>> > > you
>> >> >>>> > > store it.
>> >> >>>> > >
>> >> >>>> > >>
>> >> >>>> > >> Sanne pointed out that when the search is
done there is
>> >> >>>> > >> quite a
>> >> >>>> > >> few
>> >> >>>> > >> conversion to radians for distance
calculation and suggested
>> >> >>>> > >> that
>> >> >>>> > >> we may
>> >> >>>> > >> store directly coordinates under their
radians form.
>> >> >>>> > >>
>> >> >>>> > >> I have tried a patch to implement this
and as I was coding
>> >> >>>> > >> it I
>> >> >>>> > >> feel that
>> >> >>>> > >> the code was less readable, in the
coordinates normalisation
>> >> >>>> > >> mainly
>> >> >>>> > >> and
>> >> >>>> > >> that there was as many conversion as
before.
>> >> >>>> > >> Conversions had moved from search to
import / export of
>> >> >>>> > >> coordinates
>> >> >>>> > >> in and
>> >> >>>> > >> out the spatial module scope to user
scope.
>> >> >>>> > >
>> >> >>>> > > I'm sure the amount of points in the code
in which they are
>> >> >>>> > > converted
>> >> >>>> > > won't change. I'm concerned about the
cardinality of the
>> >> >>>> > > collections
>> >> >>>> > > on which it's applied ;)
>> >> >>>> > > "Less readable" isn't nice, but
we can work on that I guess?
>> >> >>>> > >
>> >> >>>> > >>
>> >> >>>> > >> What the docs does not tell (yet), is
that we are waiting
>> >> >>>> > >> for
>> >> >>>> > >> WGS
>> >> >>>> > >> 84 (this
>> >> >>>> > >> is a coordinate system) decimal degree
coordinates input, as
>> >> >>>> > >> these
>> >> >>>> > >> are
>> >> >>>> > >> quite a de facto standard (GPS output
this way).
>> >> >>>> > >
>> >> >>>> > > How does it affect this?
>> >> >>>> > >
>> >> >>>> > >>
>> >> >>>> > >> Today this is not the purpose of
Hibernate Search spatial
>> >> >>>> > >> initiative to
>> >> >>>> > >> handle projections. There are opensource
libs to handle that
>> >> >>>> > >> on
>> >> >>>> > >> user side
>> >> >>>> > >> very well (Proj4j)
>> >> >>>> > >>
>> >> >>>> > >> So. The question is : shall we store as
radians or decimal
>> >> >>>> > >> degree ?
>> >> >>>> > >>
>> >> >>>> > >> Niko
>> >> >>>> > >>
>> >> >>>> > >> P.S : Hope it is clear. If not ask for
more.
>> >> >>>> > >
>> >> >>>> > > Thanks!
>> >> >>>> > > Sanne
>> >> >>>> > >
_______________________________________________
>> >> >>>> > > hibernate-dev mailing list
>> >> >>>> > > hibernate-dev(a)lists.jboss.org
>> >> >>>> > >
https://lists.jboss.org/mailman/listinfo/hibernate-dev
>> >> >>>> >
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>
>