Thanks. So the problem is that we may not be able to update the batch state
upon failure, in which case we would use the less-safe AddLuceneWork upon
restart.
If we had some way to store the information "this partition has started"
*before* we even write to the index, this wouldn't be a problem, but as you
might have guessed JSR-352 doesn't allow that.
So you're right, deleting everything before we even start working is our
best solution. And thus a hidden field will be necessary. I'll continue the
discussion on JIRA.
Yoann Rodière
Hibernate NoORM Team
yoann(a)hibernate.org
On 27 April 2017 at 18:19, Sanne Grinovero <sanne(a)hibernate.org> wrote:
On 27 April 2017 at 15:11, Yoann Rodiere <yoann(a)hibernate.org>
wrote:
> I wonder, what's the benefit for HSEARCH-2616? Do you want to have that
> field so that we can just use AddLuceneWorks everywhere, and run targeted
> delete operations when we start a partition? If so, is it as a fallback
> solution, if what I proposed cannot be implemented, or as a better
> alternative? Note I don't have strong arguments against that solution,
I'm
> just trying to understand the "why".
I had written the "why" on HSEARCH-2616, but to clarify here:
I liked your idea of trying to figure out if the current block of work
is being repeated, vs it being a re-try. However while I initially
thought to add such a field as a fallback solution, I believe it's
ultimately the more robust solution as otherwise you have to trust
such state, which could be lost / wrong / corrupted independently for
a number of reasons.
Since the problem being solved is about resuming the process after a
problem happened we can't make many safe assumptions about what kind
of problem we're dealing with; for example if you run out of disk
space you'll have an half-written index but no way to store such
batch-state. Other problems might involve indexes being backed up /
restored / replicated over other technologies (rsync, Infinispan, ..)
so a mismatch between the index and other state is yet another problem
which might need caution, logs and possibly tooling.
Say an IO operation fails during an index write flush: some admin
intervenes fixing hardware and then triggers resume of indexing.
In such conditions I wouldn't trust some additional persistent state
not even if it were cryptographically signed to be correct: corruption
or signature mismatches could be detected but in this case there's the
risk of it being trustful but out of date: with IO unavailable when
this should have been written you're probably reading the previous
version which had been written. Having an out of date batch state
would likely have the opposite effect of what we need.
On the other hand, inspecting what's in the index is coupled with the
index state so while indexes could be corrupted, the progress tracking
state and the index being one thing you're not easily fooled.
Since I agree that having additional fields is not something everyone
will like, as I suggested on HSEARCH-2616 we could offer the
alternatives as fallback.
>
> On adding a hidden field, I wonder what this will mean for
Elasticsearch; if
> we start doing such things, we should clearly and explicitly state in the
> documentation that targeting existing ES schemas without adapting them to
> Hibernate Search is not supported.
> On top of that, it may hurt users upgrading Hibernate Search: Lucene may
> simply ignore queries against a field that doesn't exist in the index,
but
> I'm not sure Elasticsearch behaves that way when the field isn't even
> defined in the mapping. So users may have to upgrade their schema just
for
> that. I know Elasticsearch integration is experimental anyway, but what I
> mean is if we do that, it must be *before* Elasticsearch we drop the
> "experimental" mention on Elasticsearch integration.
Good point. Such proposals to change some internal field don't happen
very often though.
We strive to have a stable encoding, but since the index is not the
database well documented changes might be worth it.
Especially "private internal" fields should not be too hard to manage
as we can deal with them explicitly in some lenient way, and if they
don't contain end user state like in this case we don't even have to
require an index rebuild.
For people not wanting this they can have a slower mass indexer, or
not support recovery.
Thanks,
Sanne
>
>
> Yoann Rodière
> Hibernate NoORM Team
> yoann(a)hibernate.org
>
> On 27 April 2017 at 15:59, Yoann Rodiere <yrodiere(a)redhat.com> wrote:
>>
>> I wonder, what's the benefit for HSEARCH-2616? Do you want to have that
>> field so that we can just use AddLuceneWorks everywhere, and run
targeted
>> delete operations when we start a partition? If so, is it as a fallback
>> solution, if what I proposed cannot be implemented, or as a better
>> alternative? Note I don't have strong arguments against that solution,
I'm
>> just trying to understand the "why".
>>
>> On adding a hidden field, I wonder what this will mean for
Elasticsearch;
>> if we start doing such things, we should clearly and explicitly state
in the
>> documentation that targeting existing ES schemas without adapting them
to
>> Hibernate Search is not supported.
>> On top of that, it may hurt users upgrading Hibernate Search: Lucene may
>> simply ignore queries against a field that doesn't exist in the index,
but
>> I'm not sure Elasticsearch behaves that way when the field isn't even
>> defined in the mapping. So users may have to upgrade their schema just
for
>> that. I know Elasticsearch integration is experimental anyway, but what
I
>> mean is if we do that, it must be *before* Elasticsearch we drop the
>> "experimental" mention on Elasticsearch integration.
>>
>>
>> Yoann Rodière
>> Software Engineer, Hibernate NoORM Team
>> Red Hat
>> yrodiere(a)redhat.com
>>
>> On 27 April 2017 at 15:23, Sanne Grinovero <sanne(a)hibernate.org> wrote:
>>>
>>> To better implement recovery operations during MassIndexer
>>> [HSEARCH-2616] - specifically in the context of the upcoming JBatch
>>> based implementation - I'm considering the benefits of adding one more
>>> field the the Lucene index for our internal purposes.
>>>
>>> This new field is only useful for Hibernate Search internals so we
>>> shouldn't allow it to be targeted by queries, etc..
>>>
>>> There is a single precedent: we already encode the entity name, so
>>> "hiding fields" is not a new problem that we have to deal with.
It
>>> might be a reason to polish the existing concept and improve the
>>> encapsulation.
>>>
>>> Would anyone have a strong case against this?
>>>
>>> Thanks,
>>> Sanne
>>> _______________________________________________
>>> hibernate-dev mailing list
>>> hibernate-dev(a)lists.jboss.org
>>>
https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>
>>
>