[hibernate-dev] Hibernate Search: Adding more "hidden" fields to the index

Yoann Rodiere yoann at hibernate.org
Thu Apr 27 12:58:51 EDT 2017


> I had written the "why" on HSEARCH-2616, but to clarify here: [...]

Thanks. So the problem is that we may not be able to update the batch state
upon failure, in which case we would use the less-safe AddLuceneWork upon
restart.
If we had some way to store the information "this partition has started"
*before* we even write to the index, this wouldn't be a problem, but as you
might have guessed JSR-352 doesn't allow that.
So you're right, deleting everything before we even start working is our
best solution. And thus a hidden field will be necessary. I'll continue the
discussion on JIRA.

Yoann Rodière
Hibernate NoORM Team
yoann at hibernate.org

On 27 April 2017 at 18:19, Sanne Grinovero <sanne at hibernate.org> wrote:

> On 27 April 2017 at 15:11, Yoann Rodiere <yoann at hibernate.org> wrote:
> > I wonder, what's the benefit for HSEARCH-2616? Do you want to have that
> > field so that we can just use AddLuceneWorks everywhere, and run targeted
> > delete operations when we start a partition? If so, is it as a fallback
> > solution, if what I proposed cannot be implemented, or as a better
> > alternative? Note I don't have strong arguments against that solution,
> I'm
> > just trying to understand the "why".
>
> I had written the "why" on HSEARCH-2616, but to clarify here:
>
> I liked your idea of trying to figure out if the current block of work
> is being repeated, vs it being a re-try. However while I initially
> thought to add such a field as a fallback solution, I believe it's
> ultimately the more robust solution as otherwise you have to trust
> such state, which could be lost / wrong / corrupted independently for
> a number of reasons.
> Since the problem being solved is about resuming the process after a
> problem happened we can't make many safe assumptions about what kind
> of problem we're dealing with; for example if you run out of disk
> space you'll have an half-written index but no way to store such
> batch-state. Other problems might involve indexes being backed up /
> restored / replicated over other technologies (rsync, Infinispan, ..)
> so a mismatch between the index and other state is yet another problem
> which might need caution, logs and possibly tooling.
> Say an IO operation fails during an index write flush: some admin
> intervenes fixing hardware and then triggers resume of indexing.
> In such conditions I wouldn't trust some additional persistent state
> not even if it were cryptographically signed to be correct: corruption
> or signature mismatches could be detected but in this case there's the
> risk of it being trustful but out of date: with IO unavailable when
> this should have been written you're probably reading the previous
> version which had been written. Having an out of date batch state
> would likely have the opposite effect of what we need.
>
> On the other hand, inspecting what's in the index is coupled with the
> index state so while indexes could be corrupted, the progress tracking
> state and the index being one thing you're not easily fooled.
>
> Since I agree that having additional fields is not something everyone
> will like, as I suggested on HSEARCH-2616 we could offer the
> alternatives as fallback.
>
> >
> > On adding a hidden field, I wonder what this will mean for
> Elasticsearch; if
> > we start doing such things, we should clearly and explicitly state in the
> > documentation that targeting existing ES schemas without adapting them to
> > Hibernate Search is not supported.
> > On top of that, it may hurt users upgrading Hibernate Search: Lucene may
> > simply ignore queries against a field that doesn't exist in the index,
> but
> > I'm not sure Elasticsearch behaves that way when the field isn't even
> > defined in the mapping. So users may have to upgrade their schema just
> for
> > that. I know Elasticsearch integration is experimental anyway, but what I
> > mean is if we do that, it must be *before* Elasticsearch we drop the
> > "experimental" mention on Elasticsearch integration.
>
> Good point. Such proposals to change some internal field don't happen
> very often though.
>
> We strive to have a stable encoding, but since the index is not the
> database well documented changes might be worth it.
> Especially "private internal" fields should not be too hard to manage
> as we can deal with them explicitly in some lenient way, and if they
> don't contain end user state like in this case we don't even have to
> require an index rebuild.
>
> For people not wanting this they can have a slower mass indexer, or
> not support recovery.
>
> Thanks,
> Sanne
>
>
> >
> >
> > Yoann Rodière
> > Hibernate NoORM Team
> > yoann at hibernate.org
> >
> > On 27 April 2017 at 15:59, Yoann Rodiere <yrodiere at redhat.com> wrote:
> >>
> >> I wonder, what's the benefit for HSEARCH-2616? Do you want to have that
> >> field so that we can just use AddLuceneWorks everywhere, and run
> targeted
> >> delete operations when we start a partition? If so, is it as a fallback
> >> solution, if what I proposed cannot be implemented, or as a better
> >> alternative? Note I don't have strong arguments against that solution,
> I'm
> >> just trying to understand the "why".
> >>
> >> On adding a hidden field, I wonder what this will mean for
> Elasticsearch;
> >> if we start doing such things, we should clearly and explicitly state
> in the
> >> documentation that targeting existing ES schemas without adapting them
> to
> >> Hibernate Search is not supported.
> >> On top of that, it may hurt users upgrading Hibernate Search: Lucene may
> >> simply ignore queries against a field that doesn't exist in the index,
> but
> >> I'm not sure Elasticsearch behaves that way when the field isn't even
> >> defined in the mapping. So users may have to upgrade their schema just
> for
> >> that. I know Elasticsearch integration is experimental anyway, but what
> I
> >> mean is if we do that, it must be *before* Elasticsearch we drop the
> >> "experimental" mention on Elasticsearch integration.
> >>
> >>
> >> Yoann Rodière
> >> Software Engineer, Hibernate NoORM Team
> >> Red Hat
> >> yrodiere at redhat.com
> >>
> >> On 27 April 2017 at 15:23, Sanne Grinovero <sanne at hibernate.org> wrote:
> >>>
> >>> To better implement recovery operations during MassIndexer
> >>> [HSEARCH-2616] - specifically in the context of the upcoming JBatch
> >>> based implementation - I'm considering the benefits of adding one more
> >>> field the the Lucene index for our internal purposes.
> >>>
> >>> This new field is only useful for Hibernate Search internals so we
> >>> shouldn't allow it to be targeted by queries, etc..
> >>>
> >>> There is a single precedent: we already encode the entity name, so
> >>> "hiding fields" is not a new problem that we have to deal with. It
> >>> might be a reason to polish the existing concept and improve the
> >>> encapsulation.
> >>>
> >>> Would anyone have a strong case against this?
> >>>
> >>> Thanks,
> >>> Sanne
> >>> _______________________________________________
> >>> hibernate-dev mailing list
> >>> hibernate-dev at lists.jboss.org
> >>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >>
> >>
> >
>


More information about the hibernate-dev mailing list