[hibernate-dev] Hibernate Search 3.5 or 4

Tue Apr 26 06:45:41 EDT 2011

some late answers:

2011/4/21 Emmanuel Bernard <emmanuel at hibernate.org>:
> OK, if we want to do all of this we will hate to start very quickly. In fairness, I'm not sure we can even do all of this so let's make sure these are prioritized accordingly.
> I could not find the expected deadlines for AS 7 / Cre 4 but we are probably talking about June here: ie very soon.
>
> Some more comments inline.
>
> On 20 avr. 2011, at 10:21, Sanne Grinovero wrote:
>
>> Hi,
>> About changing contracts, we don't get this chance very often so we
>> should make sure we don't miss any.
>> I have some favourites I'd like to discuss:
>>
>> - work list sent to backend
>> -- As you know Lucene dropped all guarantees about serializability,
>> supporting stuff like JMS requires a format change; especially the
>> NumericField is not working right now as it was never serializable
>> (HSEARCH-681)
>
> +1
>
>> -- Lucene is being more flexible about updates, I don't think we
>> should keep remapping an "update" operation as a delete+add operation,
>> but transmit the "update operation" and let the backend figure out
>> what's best.
>
> I guess we could do that. we need to make sure collections "updates" play well in that mix.

the urgent bit of the proposal is to add an "update" operation as a
supported verb. there's no need to convert the collections updates
from using a "delete+add" soon, I just mean to make it possible to
later improve on this so that the contract allows it.

>> - DirectoryProvider
>>  -- make a "DirectoryManager" instead, which is able to provide
>> factories for both IndexReader an IndexWriters
>>  -- add utility methods like "getName()", wish I had that in some
>> cases to provide better error messages. This leads me to think that
>> instead of trying to foresee all needed methods, the extension point
>> should not be the DirectoryManager interface directly, but have people
>> plug in different aspects.
>
> That might be better also since it reduces the scope, it's easier to design the contract.
>
>> -- this is needed to support both Instantiated indexes and to make
>> good use of all new so called "Near-Real-Time" Lucene improvements.
>>
>> - ReaderProvider
>> -- (assuming should a thing would still exist): I think it would be
>> very nice if the responsibility of such a provider would be to provide
>> the IndexReader for a single index. currently it has to provide a
>> "multiReader" on each different index, making some implementations
>> very tricky (seems I got it right in SharingBufferReaderProvider, but
>> I recently had some other interesting ideas which revelaed quite
>> dounting after a draft: take responsibility of the FieldCache expiry
>> directly, to be able to plug different cache implementations, we
>> control the lifecycle and we can be much smarter).
>
> ok, we might be able to do that in a 4.1 if need be.

right, no need to make the new FieldCache integration, but we'd need
to change the ReaderProvider API to work on a single index.

>
>>
>> - backends and workers
>>  -- I'd like to make it possible to configure different backends per
>> index. currently a backend is global, while in some cases (extreme) it
>> would have been hand to configure even single shards to different
>> backends. So really a backend should be something coupled to the
>> "DirectoryManager" mentioned before. Question is, at what level is
>> sharding going to work, it could work as a multiplexing
>> DirectoryManager.
>
> Can you remind us the use case behind heterogeneous backends. There was one but I forgot.

it's mostly about performance details, the possibility to have
different entities configured with different requirements: so for
example one entity might have large indexes and use the rsync copy
algorithm via the master/slave index providers configured to synch the
index once per hour or day using async JMS as backend, while another
entity requires transactional synchronization over the cluster and so
might need an in sync JMS with the Infinispan directoryprovider.
Currently people having such a requirement need to configure
everything as synch.
There was also a case in which people wanted to use a sharding
strategy on top of this, to have some shards in high priority for the
same entity; one corner use case even wanted to have a shard policy
including a blackhole backend as one of the shards.

>
>>
>> -- defaults to change:
>> - remove the notions of transactional / batch IndexWriter setting,
>> was deprecated since long enough.
>
> ok easy
>
>> - make the FullTextEventLister final (people still extent and replace
>> it to better control when an entity is to be indexed, but I hope we
>> can solve that as well)
>
> Well it will be in a private package anyways
>
>> - default to NumericField for numeric properties
>> - set exclusive_index_use=true by default, benefits are far too high
>> and some optimizations I was thinking of are impossible if this is
>> disabled.
>
> I'm not sure I agree with that. It seems that such a default would bite a non careful user too easily.

how bite? it's not going to disable the index locking. And the
Near-Real-Time features of latest Lucene require the IndexWriter to be
always open, and this feature is so great for the way Hibernate Search
uses Lucene, it's sad that we don't support it yet.

>
>>
>> -- bridges
>> - It happened many times that we couldn't do X or optimize Y as "user
>> bridge might read/write any field"; I think we should stop exposing
>> the o.a.lucene.Document - especially since we change the format of
>> messages to the backend - and make sure to expose something as good
>> and as flexible. Need some thinking on this: we can't expose Document
>> but we want to make sure people won't ever miss advanced features for
>> which such a bridge was a nice "advanced api". Or we split the
>> concepts, having a less-powerful API and a more advanced one, which
>> could be named, and operate on the Document itself but inside the
>> backend rather than in the DocumentBuilder (so the name could be used
>> in the message to the backend to point to some transformer to apply
>> for final touches - it could be a customization of the implementation
>> which applies the message in our own format to the
>> o.a.lucene.Document)
>
> I don't think I follow you, can you expand on what you think.
> BTW I'm a bit concerned about the "serializablilty" of what would be needed to be passed around if you move FieldBridge operations in the backend.

It's really two different aspects:
1) let people still use the flexibility of custom bridges, but because
we don't expose the Document directly we'll need to expose something
which is a good replacement for it, especially because of the
serialization issues but also to be able to better "inspect" what
bridges do; I have no specific idea right now but I'm sure that we'll
be able to play some trick at this level.

2) no need to define the API now, but it might be useful for special
cases to still customize the "add to Document" aspect; about
serializability of these components, I'd see a good fit to do as you
did with analyzers: give them a name, rebuild the component on the
other side of the wire and refer to them by name. I don't think this
is a priority, but how we should do in case the 1) approach doesn't
result flexible enough for some use case I'm not aware of now.

>> - at some point, we'll need to track also which entity properties are
>> being "read" by a custom ClassBridge/DynamicBoost, to better check for
>> index dirtyness. Might be done by proxying the entity, or just having
>> the implementation declare by which properties it's affected: in this
>> case, an API change is needed but this can possibly be postponed.
>
> proxying does not solve all use cases. If a suer has a transient getter that reads data from two other getters, you don't get that info via proxying.

right; well explicit user declarations then, at least optionally.

updating the wiki now.

Sanne

>>
>> this is just out the top of my head, I'm sure I forgot to break some
>> interface ;)
>> I'll give you some time to think about it, then I'll insert the
>> proposals which survived in the wiki & JIRA.
>> (needles to say, no objections on your proposals)
>>
>> Cheers,
>> Sanne
>>
>>
>> 2011/4/20 Emmanuel Bernard <emmanuel at hibernate.org>:
>>> Hi,
>>>
>>> We have had in our road map an Hibernate Search 3.5 before Hibernate 4. Hibernate 4 is the release where the following should happen:
>>>  - split packages into API, SPI and private packages
>>>  - use JBoss Logging
>>>  - be compliant with Core 4
>>>  - break whatever contract we need to break to open up the future
>>>  - split dependency between the core of Hibernate Search and Hibernate Core
>>>
>>> Do you see more task for 4?
>>>
>>> Since Hibernate Core 4 seems to be doing alright and that the time pressure will be strong to get Hibernate Search aligned, I propose to skip 3.5 entirely and focus on 4. We did not that that many new features planned anyways for 3.5, it was more a consolidation release.
>>>
>>> Even with skipping 3.5, the 4 release will be a lot of work. We should start early. Any objection or comment?
>>>
>>> Changing contracts
>>> We have had a few contracts that we wanted to change to make way for future improvements:
>>>  - should a bridge know about the field it changes (make the optimization more efficient)
>>>  - rework the backend to let IndexReader and IndexWriter communicate
>>>  - rework the backend to support instantiated IndexReaders
>>>
>>> Can you help collect the list of changes you would like to see happening?
>>>
>>> I would like to get this work started asap, this is really the unknown quantity and we tend to be slow to converge on the things
>>>
>>> Split packages in API/SPI/private packages
>>> Hibernate 4 is the ideal time to properly split stuff into API, SPI, private. Moving classes to private packages is the least impacting move for users as these should not be used. The API / SPI split is sometimes difficult to do so if you have a doubt in an area, ask on the ML or on IRC and we can discuss it together. If you need an example, check out the query engine. It is relatively clean now.
>>>
>>> We might have to break a few user APIs which is fine but I don't expect too many will be necessary:
>>>  - make sure to discuss it when you plan to do one
>>>  - list them in the migration guide
>>>
>>> I'd say that the package splitting should be done when you have a change and when you work in a specific area. It's more a background task.
>>>
>>> Be compliant with Core 4
>>> We can do this one a bit later in the cycle to give time for core to mature.
>>>
>>> Split dependency between Hibernate Search and Hibernate Core
>>> I think in practice we are not too far. This work should be done in parallel to the package splitting. If you look at the query engine, we do have specific hibernate packages. We also have a HibernateHelper class of all low level Hibernate contracts like unproxying, initializing etc. We should use that class everywhere instead of relying on the direct Hibernate Core contracts. That will help up to move this class as an implementable contract.
>>> The next step potentially is to actually move Hibernate Core specific code into a separate package.
>>>
>>> I don't have much opinion on this but we should definitively discuss it.
>>>
>>> Use JBoss Logging
>>> I tend to think we should do this migration late in the game. WDYT?
>>>
>>> New features
>>> Do you want any new feature per se? I think this would be a great time to get the community involved to back new features and fix bugs while we do the grunt work for 4. So if you know some shy people motivated or if you are one of them, stand up :)
>>>
>>> Note: I have create a vague copy of this email in http://community.jboss.org/wiki/PlansforHibernateSearch4
>>> We can discuss via email but be sure to add the feedback or list of todos in the wiki as well for posterity.
>>> _______________________________________________
>>> hibernate-dev mailing list
>>> hibernate-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/hibernate-dev
>>>
>
>