Hi,
I've been having this discussion on the forum with Sanne (
https://forums.hibernate.org/viewtopic.php?f=9&t=1023973) about the
problems I have with cascading database reads. I thought it might be
interesting to continue this conversation here with some additional
thoughts:
There are a number of situations where an update to an object can trigger a
cascade of database reads that is required because Lucene doesn't allow
document updates but instead requires a complete recreation of the document
from scratch. These cascading reads can bring a server to its knees or lead
to application unresponsiveness.
I find that the speed of Hibernate Search's reads is often offset by the
cascade of database reads that may occur when an indexed entity is updated.
However, that very read speed is a major reason for using it, so it would
be great if the write speed problems could be alleviated.
E.g. some simpliefed examples
Network
@OneToMany
@IndexedEmbedded(includePaths={"id"})
List<User> users;
When a new User is added to the Network, all the existing Users have to be
read from the database to recreate the Lucene document.
Another headache example is when a stored property that is used for
selection purposes changes
LinkedInGroup
@Field(index=Index.YES)
boolean hidden;
@OneToMany
@ContainedIn
List<LinkedInGroupPost> posts;
@OneToMany
@IndexedEmbedded(includePaths={"id"})
Set<Tag> tags
LinkedInGroupPost
@ManyToOne
@IndexedEmbedded(includePaths={"id","hidden"})
Group group;
Assuming there can be hundreds of thousands of Posts, a change of hidden to
true would trigger a read of all those records.
While we might say that you should apply the architecture that best fits
the purpose of both the application and the technology, I really think that
Hibernate Search should be able to more easily handle these kinds of use
cases without leading to excessive database reads.
Some directions for thoughts
* In the Network example, the includedPaths only contains the id. Looking
at my own work, I often find that IndexedEmbedded references just stores
the id and I believe we should think about optimizing this use case. In
that case an optimized read from the database could be executed that just
reads that value in stead of initializing the entire entity.
This kind of "projection read" could be an optional setting even when
includePaths contains non identifier values, assuming the developer knows
which limitations this might entail (e.g. no FieldBridges, no Hibernate
cache). It's a kind of "document oriented" MassIndexer approach to Document
"updates".
* Lucene Document update support is at an alpha stage right now
LUCENE-3837. This effort could be supported by the Hibernate team or
implemented at the earliest viable moment.
* A kind of JoinFilter is conceivable where the join filter would be able
to exclude results based on selection results from another index.
E.g. one queries the LinkedInGroupPost but the JoinFilter reads
group.idreferences from the Group index (just reading the ones needed
and storing
them during the read) and excludes LinkedInGroupPosts based on the value of
"hidden". I wonder if this approach could patterned or documented.
* The documentation could contain some suggestions for dealing with the
issue of cascading initialization and how to deal with this in a smart way.
* In the tests I have done, saving a LinkedInPostGroup where the
indexedEmbedded attributes (id,hidden) are *not* dirty, all the posts are
reinitialized anyway. The reason for this is that with a Set<Tag> the set
elements are deleted and reinserted on each save even when they haven't
changed. It looks like Hibernate Search is not optimized to deal with this
"semi-dirty" situation (Hibernate ORM treats a field as dirty when it
really isn't). Nothing really changed in the relevant values for the
document but because Hibernate needs to reinsert the set, it thinks so. I
wonder if this use case can or should be optimized. If not, documentation
should warn against using Sets.
* When a document is recreated because one attribute is changed leading to
all sorts of cascading database reads I often wonder: why? The reason is
that the Index segments cannot e recreated for the indexed attributes. So
we need to read them again. But what if those attributes are actually
Stored in the original document and not dirty? Why not just read these
values straight from the document with a single read in stead of executing
a slew of database reads?