Thanks Mark for this feedback.
That's a very long multi-proposal email :) We probably should try and
split it into several elements. Let me continue in line.
On Thu 2012-11-22 20:10, Marc Schipperheyn wrote:
Network
@OneToMany
@IndexedEmbedded(includePaths={"id"})
List<User> users;
When a new User is added to the Network, all the existing Users have to be
read from the database to recreate the Lucene document.
Another headache example is when a stored property that is used for
selection purposes changes
LinkedInGroup
@Field(index=Index.YES)
boolean hidden;
@OneToMany
@ContainedIn
List<LinkedInGroupPost> posts;
@OneToMany
@IndexedEmbedded(includePaths={"id"})
Set<Tag> tags
LinkedInGroupPost
@ManyToOne
@IndexedEmbedded(includePaths={"id","hidden"})
Group group;
Assuming there can be hundreds of thousands of Posts, a change of hidden to
true would trigger a read of all those records.
One approach that might work much better in this case is to use filters
rather than indexing hidden and using it in the query as restriction. I
imagine hidden is not selective enough which does not make for the best
use of an inverted index.
* In the Network example, the includedPaths only contains the id.
Looking
at my own work, I often find that IndexedEmbedded references just stores
the id and I believe we should think about optimizing this use case. In
that case an optimized read from the database could be executed that just
reads that value in stead of initializing the entire entity.
I forgot what we said around includePaths and class level bridges but
that looks like a good idea. We might be able to look at the paths and
check if any of them contains an association. If not, we could use a
projection to query the meaningful data. That's not at all how Hibernate
Search works today so I imagine that could be a significant work but
this does not look impossible.
Can you open a JIRA issue for this.
This kind of "projection read" could be an optional setting even when
includePaths contains non identifier values, assuming the developer knows
which limitations this might entail (e.g. no FieldBridges, no Hibernate
Why do you say no FieldBridge?
* Lucene Document update support is at an alpha stage right now
LUCENE-3837. This effort could be supported by the Hibernate team or
implemented at the earliest viable moment.
We are keeping an eye on it. Lucene 4 is a major departure from Lucene 3
so the conversion won't be easy and worse won't be fully transparent for
Hibernate Search users unfortunately.
* A kind of JoinFilter is conceivable where the join filter would be
able
to exclude results based on selection results from another index.
E.g. one queries the LinkedInGroupPost but the JoinFilter reads
group.idreferences from the Group index (just reading the ones needed
and storing
them during the read) and excludes LinkedInGroupPosts based on the value of
"hidden". I wonder if this approach could patterned or documented.
I am pretty sure fitlers is what you are looking at
http://docs.jboss.org/hibernate/search/4.2/reference/en-US/html_single/#q...
* The documentation could contain some suggestions for dealing with the
issue of cascading initialization and how to deal with this in a smart way.
Sure, let's identify what we consider smart and update the doc.
Can you create a JIRA issue for that?
* In the tests I have done, saving a LinkedInPostGroup where the
indexedEmbedded attributes (id,hidden) are *not* dirty, all the posts are
reinitialized anyway. The reason for this is that with a Set<Tag> the set
elements are deleted and reinserted on each save even when they haven't
changed.
Hum, I believe it's true for the bag semantic but I'm surprised it's
true for Set. Besides, from what you are saying, you don't add nor
remove elements from the Set, you just change some non id value.
It looks like Hibernate Search is not optimized to deal with this
"semi-dirty" situation (Hibernate ORM treats a field as dirty when it
really isn't). Nothing really changed in the relevant values for the
document but because Hibernate needs to reinsert the set, it thinks so. I
wonder if this use case can or should be optimized. If not, documentation
should warn against using Sets.
Can you create a minimal test case and open a JIRA / pull request, this
needs to be investigated.
* When a document is recreated because one attribute is changed leading to
all sorts of cascading database reads I often wonder: why? The reason is
that the Index segments cannot e recreated for the indexed attributes. So
we need to read them again. But what if those attributes are actually
Stored in the original document and not dirty? Why not just read these
values straight from the document with a single read in stead of executing
a slew of database reads?
That might be true in some situations but FieldBridges are not
guaranteed to be non destructive in their stored data. So we cannot
generalize that necessarily.
We could explore this idea in a prototype. Again can you open a JIRA
issue?
Emmanuel