[HSearch] Faceting feedback

3.6 branch

Re: [hibernate-dev] HSEARCH-706 -...

Hardy Ferentschik

Monday, 28 February 2011 Mon, 28 Feb '11

9:13 a.m.

Hi, I thought it would be great to get some feedback on my faceting work. You can see the latest on my Search fork - https://github.com/hferentschik/hibernate-search/commits/HSEARCH-667 Technically I decided to use a simple custom Collector. I abandoned the idea for using bobo browse, since it did not really seem to fit our architecture and I am not sure how well maintained the code is. Within the custom Collector I am using Lucene's FieldCache to cache and collect the count values during facting (as a reminder, faceting for example means that I am searching for all cars or a certain make, but then also want to group the matching cars into their cubic capacity. See also - http://en.wikipedia.org/wiki/Faceted_search) Using the FieldCache is quite memory costly, but there are other ways to implement the faceting itself. At the moment I am mostly interested in the feedback around the public API. The public classes can be found in the package org.hibernate.search.query.facet - https://github.com/hferentschik/hibernate-search/tree/3a9877e2bbc47a8bd6e... The idea is to write a fulltext query as usual and then add/enable a facet: FacetRequest request = new SimpleFacetRequest( indexFieldName, FacetSortOrder.COUNT_DESC, false ); TermQuery term = new TermQuery( new Term( "make", "honda" ) ); FullTextQuery query = fullTextSession.createFullTextQuery( term, Car.class ); query.enableQueryFacet( "foo", request ); Then you run the query. This will enable the facet collector and after the query executed you get access to a map which maps FacetResults to the facet name. Each FacetResult contains a list of Facets which contain the actually field values and counts: Map<String, FacetResult> results = query.getFacetResults(); FacetResult facetResult = results.get( "foo" ); List<Facet> facetList = facetResult.getFacets(); assertEquals( "Wrong facet count for facet ", 100, facetList.get( 0 ).getCount() ); More actual tests can be found here - https://github.com/hferentschik/hibernate-search/tree/3a9877e2bbc47a8bd6e... At the moment you are able to facet on simple (string) based values or on number ranges (eg price ranges 0 - 100, ...). For that I have created subclasses of FacetRequest - SimpleFacetRequest and RangeFacetRequest (a DateRangeFacetRequest might be interesting as well) Some concrete questions: * Atm, I am only exposing a programmatic API for creating FacetRequests. I guess we want to have annotations for this as well, right? Would we keep the programmatic configuration as a public API? * I made the FacetRequest classes immutable atm, but this way I have a multitude of constructors catering for a whole range of parameters (sort order, include zero counts, ...). Any opinions around immutable objects vs objects with setters for configuring options after creation. If course I am interested in any other feedback as well. --Hardy

Show replies by date

Sanne Grinovero

Monday, 28 February Mon, 28 Feb

12:56 p.m.

Hi Hardy, great job! I've been reading this focusing on the tests and the Collectors implementation. If you want me to have a look at all the patches, could you please squash some commits :) Comments: 1) why is it named SimpleFacetRequest, are you planning for a more advanced one? 2) I couldn't find a test doing a faceting on more than one field, still it looks like the Collector and FieldCacheContainer are prepared to deal with that. You know the collector is a very hot area for the performance of a Query, especially at "void collect(int doc)" while working on the FieldCache support I also started with similar Map<Field, containers>, but then I realized that the number of fields I'd work on is definitely limited, so I removed all Map lookups (and all puts and foreach's) to support only a single field per collector instance, so you can specialize the single instance and eventually chain them up when more than a field is being requested. A side-effect is that the code turns out a bit simpler, and you can save some memory on "per field" containers as they can sometimes be stored in simpler types, I had some nested maps converted in arrays; maybe that's not your case but it might be worth trying. In case you support faceting on multiple facets/fields, how are you going to specify the sort order? I'm not suggesting you should, I'd rather remove support for it. Some more comments inline: 2011/2/28 Hardy Ferentschik <hibernate(a)ferentschik.de>:

...

Seems reasonable, same thing I was thinking when you talked about it, but I trust you had more insight. Is there any non-trivial faceting that we would miss this way? The examples I know of are quite simple to implement; generally speaking it seems to be more an API and integration issue so we likely have our own itches to scratch anyway.

...

Within the custom Collector I am using Lucene's FieldCache to cache and collect the count values during facting (as a reminder, faceting for example means that I am searching for all cars or a certain make, but then also want to group the matching cars into their cubic capacity. See also - http://en.wikipedia.org/wiki/Faceted_search) Using the FieldCache is quite memory costly, but there are other ways to implement the faceting itself.

Are you relying on fieldCache only, or do you have an alternative way too? I'm just wondering if alternatives are possible, as fieldCaches seem very expensive memory-wise.

...

At the moment I am mostly interested in the feedback around the public API. The public classes can be found in the package org.hibernate.search.query.facet - https://github.com/hferentschik/hibernate-search/tree/3a9877e2bbc47a8bd6e... The idea is to write a fulltext query as usual and then add/enable a facet: FacetRequest request = new SimpleFacetRequest( indexFieldName, FacetSortOrder.COUNT_DESC, false ); TermQuery term = new TermQuery( new Term( "make", "honda" ) ); FullTextQuery query = fullTextSession.createFullTextQuery( term, Car.class ); query.enableQueryFacet( "foo", request );

Wouldn't it make sense to have a fullTextSession.createFacetingQuery( LuceneQuery, facetName, targetTypes ); ? At least people won't need to cast the return type to your special container.

...

Then you run the query. This will enable the facet collector and after the query executed you get access to a map which maps FacetResults to the facet name. Each FacetResult contains a list of Facets which contain the actually field values and counts: Map<String, FacetResult> results = query.getFacetResults(); FacetResult facetResult = results.get( "foo" ); List<Facet> facetList = facetResult.getFacets(); assertEquals( "Wrong facet count for facet ", 100, facetList.get( 0 ).getCount() );

Looks great. Would it be possible (in future maybe) to return managed entities / selected projection instead of their frequency only? I don't think we should do that for next Alpha, but it would be great to have lazily loaded proxies of the contents of each facet, as usability I'd expect people to "expand" on the facet to show the actual results, which are likely going to be the usual entities, or some projection.

...

More actual tests can be found here - https://github.com/hferentschik/hibernate-search/tree/3a9877e2bbc47a8bd6e... At the moment you are able to facet on simple (string) based values or on number ranges (eg price ranges 0 - 100, ...). For that I have created subclasses of FacetRequest - SimpleFacetRequest and RangeFacetRequest (a DateRangeFacetRequest might be interesting as well)

Nice. Do you think we need the user to bother for the field type? don't you have it on the FieldBridge? I'm wondering, as I'm stuck at this point with my own patches :)

...

Some concrete questions: * Atm, I am only exposing a programmatic API for creating FacetRequests. I guess we want to have annotations for this as well, right?

I wasn't expecting that. You have an example of what you where thinking?

...

Would we keep the programmatic configuration as a public API? * I made the FacetRequest classes immutable atm, but this way I have a multitude of constructors catering for a whole range of parameters (sort order, include zero counts, ...). Any opinions around immutable objects vs objects with setters for configuring options after creation.

I usually favour immutable objects only when creating them is (very) expensive and so you'd like to reuse them in other threads having different parameters. Where you thinking of a FacetRequest as something "define once, run multiple times", like a named query? I'd give priority to the performance of the Collector, if you have to choose.

...

If course I am interested in any other feedback as well. --Hardy

Sanne

Hardy Ferentschik

2:33 p.m.

Thanks for the feedback Sanne! On Mon, 28 Feb 2011 19:56:59 +0100, Sanne Grinovero <sanne(a)hibernate.org> wrote:

...

I've been reading this focusing on the tests and the Collectors implementation.

That's the most interesting part. There are a few unrelated cleanup and testing related commits which as you say I should probably squash. I probably should go through the commits tomorrow and sort things out a little. Maybe we could even pull in this initial cut to have a common ground to work from again.

...

Comments: 1) why is it named SimpleFacetRequest, are you planning for a more advanced one?

SimpleFacetRequest is probably a bad name. Maybe ValueFacetRequest would be better. It is basically just asking for the different field values and their count for the faceted field. In contrast RangeFacetRequest groups counts into numeric ranges. As mentioned another subclass could be DateRangeFacetRequest

...

2) I couldn't find a test doing a faceting on more than one field, still it looks like the Collector and FieldCacheContainer are prepared to deal with that.

It depends what you mean. I think there should be a test with multiple in-depended facets, eg facet on engine size and color in the car example. There is also the use case pivoting where you set multiple fields into relation to each other. For example you facet on engine size and within that you facet further on color. I haven't implemented this usecase yet. Not sure if we should add this feature right away.

...

while working on the FieldCache support I also started with similar Map<Field, containers>, but then I realized that the number of fields I'd work on is definitely limited, so I removed all Map lookups (and all puts and foreach's) to support only a single field per collector instance, so you can specialize the single instance and eventually chain them up when more than a field is being requested.

Hmm, that's an interesting idea. Instead of using maps you basically have one collector per facet request. This idea is definitely worth exploring. It might even clean out some of the code. Do you have any experience what this really brings runtime wise? But I will definitely explore your idea.

...

In case you support faceting on multiple facets/fields, how are you going to specify the sort order? I'm not suggesting you should, I'd rather remove support for it.

Not sure whether you are talking about pivoting here. In the case you are using multiple in-dependent facet requests, each facet result is ordered by itself. I think having a sort order is important, at least you should be able to either sort by count or by value

...

Are you relying on fieldCache only, or do you have an alternative way too? I'm just wondering if alternatives are possible, as fieldCaches seem very expensive memory-wise.

Yes, the current implementation is based on field caches. As you say, the main concern is memory in this cases. There is an alternative I haven't explored yet working with TermEnum and doc sets intersections.

...

Wouldn't it make sense to have a fullTextSession.createFacetingQuery( LuceneQuery, facetName, targetTypes ); ? At least people won't need to cast the return type to your special container.

Not sure if I follow here. Faceting is for me more like filtering. You en- and disable facets before running the query. You still need the main result list, but also a way to access the facet results.

...

> Map<String, FacetResult> results = query.getFacetResults(); > FacetResult facetResult = results.get( "foo" ); > List<Facet> facetList = facetResult.getFacets(); > assertEquals( "Wrong facet count for facet ", 100, facetList.get( 0 > ).getCount() ); Looks great. Would it be possible (in future maybe) to return managed entities / selected projection instead of their frequency only?

That's an awesome idea. This feature could set our faceting apart from the just count based approach

...

I don't think we should do that for next Alpha, but it would be great to have lazily loaded proxies of the contents of each facet, as usability I'd expect people to "expand" on the facet to show the actual results, which are likely going to be the usual entities, or some projection.

One thing I want to add is the ability to pass in the current query into a Facet and get returned a BooleanQuery which combines the two.

...

Nice. Do you think we need the user to bother for the field type? don't you have it on the FieldBridge? I'm wondering, as I'm stuck at this point with my own patches :)

Right. It's on my todo list. In fact I am missing at the moment the meta data api we have been talking about to "reflect" on the field configuration. I need to check what is possible atm.

...

> Would we keep the programmatic configuration as a public API? > * I made the FacetRequest classes immutable atm, but this way I have a > multitude of constructors catering for a whole range of parameters > (sort order, include zero counts, ...). Any opinions around immutable > objects vs objects with setters for configuring options after creation. I usually favour immutable objects only when creating them is (very) expensive and so you'd like to reuse them in other threads having different parameters. Where you thinking of a FacetRequest as something "define once, run multiple times", like a named query?

right. I think I lean towards mutable request objects in this case. I wanted to see where thigs are going with this immutable request object, but I think it is not worth it in this case

...

I'd give priority to the performance of the Collector, if you have to choose.

Sure Thanks again, --Hardy

Emmanuel Bernard

Tuesday, 1 March Tue, 1 Mar

7:39 a.m.

Hi Hardy, Nice job. Here are a few comments in random order: To avoid the problem of Constructor multiplications and still use immutable objects use a builder to collect the information and create the object out of it. The constructor can even be package private My initial reaction was that facet would be an awesome declarative feature like analyzers at least for simple use case (annotations + programmatic mapping API). So +1 for the feature. It seems though that some people would want a very dynamic way to configure their facet so keeping a programmatic API seems to make sense as well. I am not a expert in faceting so feel free to correct me. If we are not sure, let's start with the pure declarative approach and expand to the programmatic API later. In tests, we should try and use the Query DSL instead of raw lucene queries: we try to convert people to this nicer approach and it shows us how well or abd the rest of the API is when integrated. To be consistent, SimpleFacetRequest should be named DiscreteFacetRequest to be symmetric with RangeFacetRequest. Sanne was mentioning the idea of lazy results. Be careful if this laziness means keeping some resources open, we usually don't do that (except when using scroll). An integration with the QueryDSL would be awesome and a much nicer programmatic API. It's a shame e can't add payload to lucene queries to integrate that further. FacetRequest request = carBuilder .facet() .named("prices") .range() //or should it be range(Integer.class) .onField("price") .from(0).to(1000).excludeLimit() .from(1000).to(1500).excludeLimit() .above(1500) .orderedBy(FIELD_VALUE); query.enableFacet(request); something like that. BTW no relation to facet but, we could change QueryBuilder to QueryBuilder<T> (as in QueryBuilder<Car>). We don't need the info but that might help people to avoid mixing queryBuilders and query results by accident. I'd put the name in the request object and add remove 'query' in enableQueryFacet. I am getting a bit worried about these APIs that are used under certain conditions only like getFacetResults(). I wonder if we could do something nicer. I have no real solution, maybe via composition or via the returned object of enableFacet? That's all for now :) On 28 févr. 2011, at 16:13, Hardy Ferentschik wrote:

...

Hardy Ferentschik

8:08 a.m.

Thanks for the feedback. Answers inline On Tue, 01 Mar 2011 14:39:58 +0100, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

To avoid the problem of Constructor multiplications and still use immutable objects use a builder to collect the information and create the object out of it. The constructor can even be package private

I guess that the third alternative. If I remember right that is also a pattern described in "Effective Java". I am mainly wondering what it really gives me to make these objects immutable.

...

My initial reaction was that facet would be an awesome declarative feature like analyzers at least for simple use case (annotations + programmatic mapping API). So +1 for the feature. It seems though that some people would want a very dynamic way to configure their facet so keeping a programmatic API seems to make sense as well. I am not a expert in faceting so feel free to correct me. If we are not sure, let's start with the pure declarative approach and expand to the programmatic API later.

I think we will need both.

...

In tests, we should try and use the Query DSL instead of raw lucene queries: we try to convert people to this nicer approach and it shows us how well or abd the rest of the API is when integrated.

Just haven't gotten into the habit yet. Personally I am still thinking in Lucene queries.

...

To be consistent, SimpleFacetRequest should be named DiscreteFacetRequest to be symmetric with RangeFacetRequest.

DiscreteFacetRequest is a much better name. thanks

...

Sanne was mentioning the idea of lazy results. Be careful if this laziness means keeping some resources open, we usually don't do that (except when using scroll).

Right. This whole idea needs to be thought through, but I think it is worth perusing.

...

It's a shame e can't add payload to lucene queries to integrate that further.

It will be interesting to see how Lucene will add the basic faceting code to its API. At least they started talking about it.

...

FacetRequest request = carBuilder .facet() .named("prices") .range() //or should it be range(Integer.class) .onField("price") .from(0).to(1000).excludeLimit() .from(1000).to(1500).excludeLimit() .above(1500) .orderedBy(FIELD_VALUE); query.enableFacet(request);

Got you. This brings up another question. Should we only have this DSL based approach or should we keep also have the current programmatic approach. If we keep it all we end could end up with annotation, DSL and "standard programmatic" versions of the API

...

I am getting a bit worried about these APIs that are used under certain conditions only like getFacetResults(). I wonder if we could do something nicer. I have no real solution, maybe via composition or via the returned object of enableFacet?

Same here. This was another question I forgot to add to my original email. It would be nice to find a solution for this. --Hardy

Emmanuel Bernard

10:34 a.m.

On 1 mars 2011, at 15:08, Hardy Ferentschik wrote:

...

Thanks for the feedback. Answers inline On Tue, 01 Mar 2011 14:39:58 +0100, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > To avoid the problem of Constructor multiplications and still use immutable objects use a builder to collect the information and create the object out of it. The constructor can even be package private I guess that the third alternative. If I remember right that is also a pattern described in "Effective Java". I am mainly wondering what it really gives me to make these objects immutable.

Life fulfillment.

...

> FacetRequest request = > carBuilder > .facet() > .named("prices") > .range() //or should it be range(Integer.class) > .onField("price") > .from(0).to(1000).excludeLimit() > .from(1000).to(1500).excludeLimit() > .above(1500) > .orderedBy(FIELD_VALUE); > > query.enableFacet(request); Got you. This brings up another question. Should we only have this DSL based approach or should we keep also have the current programmatic approach. If we keep it all we end could end up with annotation, DSL and "standard programmatic" versions of the API

I see the DSL as a better approach for the programmatic API, not an alternative. We would keep the underlying API private to us basically. Do you think that make sense?

Hardy Ferentschik

10:36 a.m.

On Tue, 01 Mar 2011 17:34:58 +0100, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

>> FacetRequest request = >> carBuilder >> .facet() >> .named("prices") >> .range() //or should it be range(Integer.class) >> .onField("price") >> .from(0).to(1000).excludeLimit() >> .from(1000).to(1500).excludeLimit() >> .above(1500) >> .orderedBy(FIELD_VALUE); >> >> query.enableFacet(request); > > Got you. This brings up another question. Should we only have this DSL > based approach or should we > keep also have the current programmatic approach. If we keep it all we > end could end up > with annotation, DSL and "standard programmatic" versions of the API I see the DSL as a better approach for the programmatic API, not an alternative. We would keep the underlying API private to us basically. Do you think that make sense?

Sure.

Sanne Grinovero

Thursday, 10 March Thu, 10 Mar

11:48 a.m.

2011/3/1 Emmanuel Bernard <emmanuel(a)hibernate.org>:

...

I don't mean to keep the index resources open, you can get all EntityInfos you need eagerly, build the faceting map, and close the indexSearcher, but then the map implementation you return can load the entities contained in the values on demand instead of loading all of them at once. Also about the default return type of a faceting query, I'd see it very close the the standard query: the list() API on Session returns a List of managed entities, the faceting API returns a Map of managed entities. In alternative, as the projection options might be enabled, same as the list() API returns a list of arrays, the faceting returns a Map having each key one of the requested facets, and each value a list of arrays (same as list()) So in my opinion the default behaviour of a faceting Query should be to return managed entities as usual, just being organized in a Map structure instead than the usual List. Cheers, Sanne

...

An integration with the QueryDSL would be awesome and a much nicer programmatic API. It's a shame e can't add payload to lucene queries to integrate that further. FacetRequest request = carBuilder .facet() .named("prices") .range() //or should it be range(Integer.class) .onField("price") .from(0).to(1000).excludeLimit() .from(1000).to(1500).excludeLimit() .above(1500) .orderedBy(FIELD_VALUE); query.enableFacet(request); something like that. BTW no relation to facet but, we could change QueryBuilder to QueryBuilder<T> (as in QueryBuilder<Car>). We don't need the info but that might help people to avoid mixing queryBuilders and query results by accident. I'd put the name in the request object and add remove 'query' in enableQueryFacet. I am getting a bit worried about these APIs that are used under certain conditions only like getFacetResults(). I wonder if we could do something nicer. I have no real solution, maybe via composition or via the returned object of enableFacet? That's all for now :) On 28 févr. 2011, at 16:13, Hardy Ferentschik wrote: > > Hi, > > I thought it would be great to get some feedback on my faceting work. > You can see the latest on my Search fork - > https://github.com/hferentschik/hibernate-search/commits/HSEARCH-667 > > Technically I decided to use a simple custom Collector. I abandoned the > idea for using bobo browse, since it did > not really seem to fit our architecture and I am not sure how well > maintained the code is. > Within the custom Collector I am using Lucene's FieldCache to cache and > collect the count values during facting > (as a reminder, faceting for example means that I am searching for all > cars or a certain make, but then also want > to group the matching cars into their cubic capacity. See also - > http://en.wikipedia.org/wiki/Faceted_search) > Using the FieldCache is quite memory costly, but there are other ways to > implement the faceting itself. > > At the moment I am mostly interested in the feedback around the public > API. The public classes can be found in > the package org.hibernate.search.query.facet - > https://github.com/hferentschik/hibernate-search/tree/3a9877e2bbc47a8bd6e... > > The idea is to write a fulltext query as usual and then add/enable a facet: > > FacetRequest request = new SimpleFacetRequest( indexFieldName, > FacetSortOrder.COUNT_DESC, false ); > TermQuery term = new TermQuery( new Term( "make", "honda" ) ); > FullTextQuery query = fullTextSession.createFullTextQuery( term, Car.class > ); > query.enableQueryFacet( "foo", request ); > > Then you run the query. This will enable the facet collector and after the > query executed you get access to a map > which maps FacetResults to the facet name. Each FacetResult contains a > list of Facets which contain the actually > field values and counts: > > Map<String, FacetResult> results = query.getFacetResults(); > FacetResult facetResult = results.get( "foo" ); > List<Facet> facetList = facetResult.getFacets(); > assertEquals( "Wrong facet count for facet ", 100, facetList.get( 0 > ).getCount() ); > > More actual tests can be found here - > https://github.com/hferentschik/hibernate-search/tree/3a9877e2bbc47a8bd6e... > > At the moment you are able to facet on simple (string) based values or on > number ranges (eg price ranges 0 - 100, ...). For that I have created > subclasses of > FacetRequest - SimpleFacetRequest and RangeFacetRequest (a > DateRangeFacetRequest might be interesting as well) > > Some concrete questions: > * Atm, I am only exposing a programmatic API for creating FacetRequests. I > guess we want to have annotations for this as well, right? > Would we keep the programmatic configuration as a public API? > * I made the FacetRequest classes immutable atm, but this way I have a > multitude of constructors catering for a whole range of parameters > (sort order, include zero counts, ...). Any opinions around immutable > objects vs objects with setters for configuring options after creation. > > If course I am interested in any other feedback as well. > > --Hardy > > > > > _______________________________________________ > hibernate-dev mailing list > hibernate-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/hibernate-dev _______________________________________________ hibernate-dev mailing list hibernate-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev

5559

days inactive

5569

days old

hibernate-dev@lists.jboss.org

Manage subscription

7 comments

3 participants

tags (0)

participants (3)

Emmanuel Bernard
Hardy Ferentschik
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[HSearch] Faceting feedback