Hibernate Search Metadata API: Numeric and other special Field types in Hibernate Search
by Sanne Grinovero
The new FieldSettingsDescriptor [1] has a couple of methods meant for
Numeric fields:
/**
* @return the numeric precision step in case this field is indexed as
a numeric value. If the field is not numeric
* {@code null} is returned.
*/
Integer precisionStep();
/**
* @return {@code true} if this field is indexed as numeric field,
{@code false} otherwise
*
* @see #precisionStep()
*/
boolean isNumeric();
Today we have specific support for the
org.apache.lucene.document.NumericField type from Lucene, so these are
reasonable (and needed to build queries) but this specific kind is
being replaced by a more general purpose encoding so that you don't
have "just" NumericField but can have a wide range of special fields.
So today for simplicity it would make sense to expose these methods
directly on the FieldSettingsDescriptor as it makes sense for our
users, but then also the #isNumeric() is needed as not all fields are
numeric: we're having these extra methods to accommodate for the needs
of some special cases.
Considering that we might get more "special cases" with Lucene4, and
that probably they will have different options, would we be able to
both decouple from these specific options and also expose the needed
precisionStep ?
I won't mention my favorite Vattern. I've considered adding subtypes
but not liking it as their usage would not be clear from the API.
Cheers,
Sanne
1 - as merged two minutes ago
11 years, 4 months
Design: HSEARCH-1032 MassIndexer with a live update mechanism
by Sanne Grinovero
Current priorities on Search are:
- Infinispan IndexManager -> me
- Metadata API -> Hardy
- Multitenancy (aka dynamic Sharding) -> me + Emmanuel + Dimitrios
Those are all important as they represent hard requirements for other
projects, but I'd also like to consider at least the basic design for
how the MassIndexer could operate in "update mode": a highly requested
mode in which it re-synchronizes the index with the database but
without wiping out the index, which creates a window in time of the
application in which results are not complete.
# Reminder on current design:
1- deletes the current index
2- scrolls on all entities and uses ADD index operations to add them all again
There are two basic approaches on the table (other ideas welcome) :
- #A Use UPDATE index operations instead, skipping the initial delete
- #B Rebuild the index in a secondary directory, then switch
Let's explore them:
#A Use UPDATE index operations instead, skipping the initial delete
## what
Technically an UPDATE operation is - in Lucene terms - an atomic
(delete+add); the benefit is that each query will either see the
previous document or the updated one, there is no possibility that the
doc is skipped as there is no possibility to flush the changes between
the delete and the add operation.
## performance
The reason the current design deletes all elements at the start of the
process, is that this is a very efficient operation: it targets a
single term (the class name field) or in some cases targets the whole
index, so just needs to delete all segments files.
When doing a delete operation on a per-document base, instead of a
class, that very likely needs a deletion on multiple terms (which is
not efficient at all as it needs to IO to seek across multiple disk
positions), and of course the worse point is that it triggers a delete
operation for each and every entity. To compare, a single ADD doesn't
need any disk seek as we can pack multiple operations in one - until
buffer is full - but any single delete requires N disk seeks (N is not
directly the number of fields but is proportional to it).
Based on this, and on experience with the #index() method
benchmarking, I'm expecting the UPDATE strategy to be approximately a
thousand times slower than the current MassIndexer implementation..
considering for some it takes a couple of hours, going to 2000 hours
is maybe not an option :-) (that's 3 months)
## left over entries
Another problem is that if we scroll on all entities from the
database, we're failing to delete documents in the index for which
there is no match anymore.
So we would need a final phase in which we run the inverse iteration:
for each element in the index, verify if there is a match in the
database; sounds like an ugly lot of queries, even if we batch it in
verification blocks.
bottomline, looks messy.
#B Rebuild the index in a secondary directory, then switch
## performance
No big concerns, but we assume there is enough space for at least four
times the size of the index (because we normally need twice to be able
to compact one, and we have two to manage).
## design
The good part is that we can reuse most of the existing MassIndexer;
but transactional changes (those applied by the application during a
reindexing) need to be redirected to both the indexes: the one being
used until the rebuild is complete so that the queries stay
consistent, and also enqueued into the one being built so that they
don't get lost in case they apply to documents which have already been
indexed. The queue handling is tricky, because in such case further
additions actually need to be updates, unless we can keep them on hold
in a buffer to be applied on the pristine index: could take quite some
memory, depending on the amount of changes flying in during the
massindexing. If the queue grows beyond reason we'll need to either
apply backpressure on the transactions or offload to disk or change to
an update strategy for the remaining massindexing process.. none of
these are desirable but I guess people could tune to make this
condition unlikely.
## SPI changes
With this design we need to be able to:
- dynamically instantiate a second Directory in a different path
- switch to delegate writes to both directories / one directory
- control from where Readers are opened
- make sure closed Readers go back to the original pool where they
come from as their reference source could have been changed
- be able to switch (permanently) to a different active index
- destroy old index
I'm afraid each of these can affect our SPIs; likely at least
IndexManager. I hope we can have all the logic in "behind the scenes"
code which drives the same SPIs as of today but I'd need a POC to
verify this.
## Directory index path
If we switch from one Directory to another - thinking about the
FSDirectory - we're either violating the path configuration options
from the user or we need to move the new index into the configured
position when done. If the above sounds a bit complex, I'm actually
more concerned about implementing such an atomic move on the
filesystem.
I guess we could agree that if the user configured an index to be in -
say - "/var/lucene/persons" we could store the indexes in
"/var/lucene/persons/index-a" and "/var/lucene/persons/index-b",
alternating in similar way to the FSMasterDirectoryProvider, but that
takes away some control on index position and is not backwards
compatible. Would this be acceptable?
# Timeline
This might need to be moved to 5.0 because of the various backwards
compatibility concerns - ideally if some community user feels to
participate we could share some early code in experimental branches
and work together.
Comments and better ideas welcome :)
Sanne
11 years, 4 months
[OGM] Embedded MongoDB for tests
by Gunnar Morling
Hi all,
I just came across across "EmbedMongo" [1] which provides a way to run
MongoDB embedded within an application. This is e.g. convenient for tests
as it doesn't require a separately installed MongoDB instance.
I've tried it out with a single test and it worked as expected.
Unfortunately MongoDB (the server) can't be retrieved as Maven dependency,
EmbedMongo thus retrieves the distribution via HTTP and stores it in
~/.embedmongo/. This only happens once during the first usage.
What do you think, would that be helpful to be used for the OGM MongoDB
tests (it might well be that this or similar options have been discussed
before and I just missed that)?
--Gunnar
[1] https://github.com/flapdoodle-oss/embedmongo.flapdoodle.de
11 years, 4 months
Marking API members as incubating
by Gunnar Morling
Hi,
Hardy and I have been musing about how to mark new API members (methods,
classes etc.) which are still incubating or experimental.
Of course we have Alpha, Beta releases etc. but there can be cases where it
makes sense to ship new functionality with a final release and still leave
the door open for refinements in the next release based on user feedback.
So basically we're looking for a way to inform the user and say "it's ok to
use this API, but be prepared to changes in the future". One way to do this
is documentation, i.e. prose or a custom JavaDoc tag such as @experimental.
This has been done in HSEARCH before.
Alternatively a Java 5 annotation could be used which I'd personally find
advantageous for the following reasons:
* With an annotation, the generated JavaDoc gives you a list with all
incubating members out of the box, see e.g. the Guava docs for an example
[1].
* For an annotation we can provide proper documentation in form of JavaDoc,
i.e. the user of the API can inspect the docs of @Incubating from within
the IDE and learn about the rules behind it. For a tag, a user would only
see the specific comment of a given instance.
* An annotation is more tool friendly, e.g. a user could easily find all
references to @Incubating in her IDE or even write an annotation processor
or a custom CheckStyle rule issuing a build warning when using an
incubating member
Such an annotation would have a retention level of SOURCE similar to other
documenting annotations such as @Generated.
Any thoughts?
--Gunnar
[1]
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/...
11 years, 4 months
Sybase BLOB loading errors
by Sanne Grinovero
We have the below error reported since quite a while in the Hibernate
Search testsuite, when run on Sybase.
I remember when initially noticing that someone told me it was about a
known problem in ORM, but I didn't track the JIRA issue. Someone knows
please?
TiA
Sanne
Error Message
The method com.sybase.jdbc4.jdbc.SybCursorResultSet.getBlob(String) is
not supported and should not be called.
Stacktrace
java.lang.UnsupportedOperationException: The method
com.sybase.jdbc4.jdbc.SybCursorResultSet.getBlob(String) is not
supported and should not be called.
at com.sybase.jdbc4.jdbc.ErrorMessage.raiseRuntimeException(Unknown Source)
at com.sybase.jdbc4.utils.Debug.notSupported(Unknown Source)
at com.sybase.jdbc4.jdbc.SybResultSet.getBlob(Unknown Source)
at org.hibernate.type.descriptor.sql.BlobTypeDescriptor$1.doExtract(BlobTypeDescriptor.java:64)
at org.hibernate.type.descriptor.sql.BasicExtractor.extract(BasicExtractor.java:64)
at org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:261)
at org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:257)
at org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:247)
at org.hibernate.type.AbstractStandardBasicType.hydrate(AbstractStandardBasicType.java:332)
at org.hibernate.persister.entity.AbstractEntityPersister.hydrate(AbstractEntityPersister.java:2912)
at org.hibernate.loader.Loader.loadFromResultSet(Loader.java:1673)
at org.hibernate.loader.Loader.instanceNotYetLoaded(Loader.java:1605)
at org.hibernate.loader.Loader.getRow(Loader.java:1505)
at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:713)
at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:683)
at org.hibernate.loader.Loader.loadSingleRow(Loader.java:379)
at org.hibernate.internal.ScrollableResultsImpl.prepareCurrentRow(ScrollableResultsImpl.java:240)
at org.hibernate.internal.ScrollableResultsImpl.next(ScrollableResultsImpl.java:117)
at org.hibernate.search.test.bridge.tika.TikaBridgeBlobSupportTest.indexBook(TikaBridgeBlobSupportTest.java:128)
at org.hibernate.search.test.bridge.tika.TikaBridgeBlobSupportTest.testDefaultTikaBridgeWithBlobData(TikaBridgeBlobSupportTest.java:74)
11 years, 4 months
Re: [hibernate-dev] Deprecating configurability of "hibernate.search.worker.scope" ?
by Sanne Grinovero
On 11 July 2013 15:29, Hardy Ferentschik <hardy(a)hibernate.org> wrote:
> I find them confusing as well and cannot thing of an actual use case.
> I assume you are removing the hibernate.search.worker.* settings as well, right?
Partly: we still need the options which apply to our "one and only"
implementation, the TransactionalWorker
To be clear, those defined in
Table 3.3. Execution configuration
should stay.
Right?
Sanne
>
> if so +1
>
> --Hardy
>
>
> On 11 Jan 2013, at 4:04 PM, Sanne Grinovero <sanne(a)hibernate.org> wrote:
>
>> I'm wondering if this property is really useful. Someone has a
>> practical example in which he would need it?
>>
>> http://docs.jboss.org/hibernate/search/4.3/reference/en-US/html_single/#t...
>>
>> I'm tempted to deprecate it and remove the description from the
>> documentation as so far I've only seen people asking clarifications
>> about it, to then conclude they don't need this (or more likely didn't
>> understand it and decide to stay away from it).
>>
>> We could technically leave the loading code in place, it's just the
>> documentation which is troubling me.
>>
>> Cheers,
>> Sanne
>> _______________________________________________
>> hibernate-dev mailing list
>> hibernate-dev(a)lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>
11 years, 4 months