[hibernate-dev] HSEARCH virtual fields (WAS Re: About mapping of @IndexedEmbedded and @DocumentId)

Wed Mar 16 14:49:35 EDT 2011

2011/3/16 Emmanuel Bernard <emmanuel at hibernate.org>:
> It's an interesting idea but I am not sure it works in that many practical cases. It's definitely worth exploring though.
>
> Here are a few comments:
>  - does not work for embedded collections

no, but we don't support them yet ;)
Also embedded collections wouldn't have their own index, so I see no
point in applying such an optimization.

>  - => does not work in a bidirectional @ManyToOne ot @*ToMany environment (quite common)
Agree, but in the Lucene community there's talking on new possible
approaches to map parent/child relations of documents, like
LUCENE-2454,
which would be nice to have, but requires some sort of new
"intelligence" during indexing and querying.
Mentioning it as it looks like related to this, it would be a nice
complement (even tough only one parent/child relation per doc seems
supported by the idea).

>  - work in a unidirectional env where the associated entity has no back dependency
>  - will need to store _hibernate_class for the associated entities
>  - mental note to stop talking with you, it generates weird ideas
>
> I've noticed that you're quite concerned about index size generally, do you think size is a major drawback?

Not *major* but when the index is too big, performance gets worse and
scaling gets harder.
I've seen some people cut mercilessly on features to get the index
size down to something that can be cached in memory.
Also an interesting talk at Apache Lucene eurocon 2010 mentioned the
strategy to have multiple copies of the index, with varying grades of
queries they could support, to store the smallest in memory, the
medium on fast SSDs, the biggest on mass storage: analysing usage
patterns he was able to route the biggest amount of queries to the
smallest and quickest index.

Infinispan helps a bit as you might be able to get help from other
nodes instead of going to mass storage, but still you want it all in
memory - if Infinispan has to swap out as well it won't look very
good, so you'll end up again figuring out how to slim down the index.
In conclusion, as a user I wouldn't like it to see a library forcing
me to store metadata in the index unless there where good reasons.

Cheers,
Sanne

>
> On 15 mars 2011, at 19:00, Sanne Grinovero wrote:
>
>> Well not being this very urgent I'm going to create two JIRAs, on it,
>> one for the norms, and one for possibly changing the storage of the
>> optional field.
>> just wanted to mention it as while debugging I found it surprising behaviour.
>>
>> *But* your last sentence got me a weird idea.
>>
>> Nowadays if you have two related entities, both indexed, and one is
>> @ContainedIn in the other, we'll create two documents. Actually if you
>> think about it, we create two documents which contain exactly the same
>> values, just the fieldnames are different.
>> So what happens is that we're actually duplicating the size of the
>> index, or even more depending on the depth of the tree of indexed &&
>> related objects. And we also analyse the same text twice, using the
>> same analyser.
>>
>> If we had something like "virtual fields", something for which we
>> dynamically map the fieldName to an internal different name, we could
>> re-route indexing and also queries built using the DSL in clever ways,
>> having them point the correct fields.
>>
>> The downside is of course that we take away the option to directly
>> access the Document in FieldBridges and ClassBridges - nasty, but
>> we'll likely need that anyway as nothing is serializable anymore, so
>> we already need to provide some proxy object mimicking the Document,
>> to be used as a value container to being sent to the backend.
>>
>> Other smaller advantages:
>> * We'll be able to intercept what FieldBridges actually write to,
>> don't remember now but in some cases we where stuck on that (like
>> detecting duplicates and conflicting fieldnames)
>> * some better JOIN implementations can use it, as long as the number
>> of elements is limited, by creating fieldname_1 fieldname_2, etc.. or
>> just using the same name but controlling the order in which they're
>> listed.
>> * Updating index - a single document can be split in parts, so that
>> when an entity is indexed we don't have to reload/rewrite many related
>> entities. Don't know the details, but remember having read some
>> limited form of this is possible.
>>
>> I'm sure we can come up with more, let's see if it's worth the
>> complexity and other problems. I'd propose it for version 4, a nice
>> breaking change :)
>>
>> Cheers,
>> Sanne
>>
>>
>>
>>
>> 2011/3/15 Emmanuel Bernard <emmanuel at hibernate.org>:
>>>
>>> On 15 mars 2011, at 18:24, Emmanuel Bernard wrote:
>>>
>>>>
>>>> On 15 mars 2011, at 16:43, Sanne Grinovero wrote:
>>>>
>>>>> I guess we should prevent the indexing of ids in secondary elements?
>>>>
>>>> If the associated element is an entity, it's perfectly valid to index its id and query by it
>>>>
>>>> //return all books when the author's id is 2
>>>> "author.id:2"
>>>
>>> So if we find a way to support this use case and yours, let's go.
>>> BTW I agree we should at least not store the norm.
>>>
>>> Maybe something like that:
>>>
>>> //default to todays' behavior minus the norm
>>> @DocumentId(indexWhenEmbedded=@Field(...))
>>>
>>>
>
>