[hibernate-dev] HSEARCH State to transfer

Thu Aug 4 15:39:12 EDT 2011

>>> 1) Remember all operations are implicitly scoped to a single index, so
>>> for example you don't need to make a difference between
>>> Optimize(classType) and OptimizeAll, they will do the same: optimize
>>> the index.
>> 
>> That's a good point. Is that your last word though? Will we want to limit messages passing through by regrouping several backends under one message?
> 
> I'm 99% confident on it; do you foresee a good reason to re-shard /
> split again the index?
> [..,]
> advantage of it. You're designing an upgradeable protocol right :P ?

I've removed Optimize

>> So far I'm looking at MessagePack without much success. Documentation is sparse and they don't seem to support reading the version before the rest of the message.
> 
> Since MessagePack is "JSon like", as far as I understood it should be
> able to always succeed in parsing, so we would need the version number
> only *after* it parsed the message to see how we can interpret it.
> Though I'm inclined to think that Proto Buffers is a better fit, if
> you're considering external libraries, as it helps with the problem of
> adding/removing data in different releases.

I gave up on MessagePack, the doc is simply too sparse. But from my trials I don't think you are correct. The only way to make it read a byte[] was to have a corresponding object. You can't say store the version and then store the rest. Nor is the protocol self documenting like JSON is. I might be wrong, if someone wants to take another look feel free.

Protocol Buffer requires a schema and requires class generation (ie not as flexible as JSON).

BERT might be a good candidate for what you want to achieve http://bert-rpc.org/ But I don't think there is a Java implementation.

I went for Apache Avro which I think will do what we want (though it's not a JSON like model).
Avro requires a schema but has an API to dynamically read and use the data (based on a given schema), so we don't need to generate classes. Avro has some well defined rules for a reader at version n receiving a message written by a writer at version m. For example:

- you can add enums, as long as the messages don't use the new enum value, the reader will be able to parse and process them for n < m (and always for n > m)
- you can add new HSEARCH operations (new element in a union), as long as the messages don't use the new operation,  the reader will be able to parse and process them for n < m (and always for n > m)

So we could have a soft forward compatibility.

For stronger breaks, we will need to add a version *before* the byte[] of serialized Avro data. `<version><avro bytes>` so that we can say if we can read the schema or not. By keeping older versions of the schema, we can read old but incompatible data and do our best.

You can have a look at http://github.com/emmanuelbernard/hibernate-search/tree/745 esp AvroTest (and in test/resource).
Note that I have not wired LuceneWork and Avro yet though all the structure is here.

Emmanuel