[Search] Serialization issues of Lucene Work

Friday, 29 July 2011

####  background ####

Search used to be split in two main components:

  engine ---> indexing backend

- there was a contract (public API) between the two so that indexing
backends could be replaced

- cluster configuration needed the "-->" invocation to be replaced by
an RPC, therefore forcing the parameters required by this public API
to be Serializable.

These parameters are essentially a List<LuceneWork>, where a
LuceneWork instance contains some simple primitives and an instance of
org.apache.lucene.document.Document.
This Document *used* to be Serializable, and so this worked fine, with
the minor inconvenience that we could not add more properties to
LuceneWork without introducing some painful class incompatibilities in
clustered deployments.

The Lucene project decided that maintaining the guarantees of
implementing Serializable is too much of a burden, and in fact the
NumericField has never been Serializable, hence this bug is open on
Search since we introduced NumericField support:

HSEARCH-681 - NotSerializableException when NumericField gets
serialized in JMSBackendQueueProcessor

#### new architecture ####

In Hibernate Search 4 there is an additional level of indirection to
the actual communication, it looks like

  engine ---> index manager --> backend

and both components are replaceable; in fact you could plug in an
IndexManager which deals with backends in a totally different way, so
the RPC channel can use a different format which is not mandated by
the API (the second indirection still defines an interface, and by
using that you can reuse a larger set of provided components, but you
don't have to).

Example: an Infinispan IndexManager would not use a standard backend
but rather use Infinispan's own communication channels to send write
operations to the index. It could still use a JMS backend by
assembling the existing components.

#### the problem ####

So we still have to find a way to serialize the Document instances, tracked by

HSEARCH-757 	- Explicitly control binary format of communication with
the backend

I started this mail from the architecture to clarify that we don't
need to replace the API making use of LuceneWork instances, which is
doing a pretty good job (and is not necessarily the final API for v.
4.0).
We also don't need to mandate a specific binary format, as this could
be a detail left to different backends; but certainly all
implementations would need to deal with this so we need an helper
service which could be reused by JMS backends, JGroups, Infinispan,
possibly others.
As soon as we have such a toy, implementing a new Infinispan
IndexManager is going to be pretty easy so I'm looking forward for
this as a great means to simplify configuration (and have it working
with NumericFields); it's also possible that other fields in the
Lucene implementation might drop Serializable soon.

# Solution option A)

Code a new utility from scratch which provides this bi-directional
transformation:
List<LuceneWork> <--> byte[]

Pros:
 - flexible, lovely do-it-yourself with no dependencies.

Cons:
 - since Lucene doesn't want to care about Serializable, it's possible
that they will sneak in new fields / different fields without notice
in minor releases. This is going to need excellent tests as it
requires manual code inspection and will become a maintenance overhead
(more than usually).

# Solution option B)

Use JBoss Marshaller to implement the same. We will likely still need
to write the details of how to externalize specific Lucene classes,
but it's supposed to provide many high performance helpers.

Pros:
 - via Infinispan we already depend on this, but this applies only to
the hibernate-search-infinispan module.
 - when Lucene changes class format, it will help to deal with it as
it adapts to the class definition ( we might notice better ).

Cons:
 - will add more dependencies to hibernate-search-core, or we split
out all the support for clustering in sub modules.
 - while it adapts to the class format, produced byte[] streams will
be incompatible; we can deal with this by storing example streams in
constants and use them in tests.

# Solution option C)

Don't serialize the Document at all, but send over only the metadata
we need encoded in a different ad hoc structure.
## Solution option C+JBM)
Even doing so, we could optionally introduce JBoss Marshaller to avoid
slow java Serialization.

Pros:
 - better isolation from Lucene changes

Cons:
 - slower "time to market" to expose new Lucene features: until we add
it, people won't be able to use it.
 - We might forget some use case/ make wrong assumptions on the data,
making it impossible for people to workaround it unless they plug a
different backend implementation.

####
WDYT?

[Davide, you're in CC as we where considering upgrading your
contributor status from beginner, to do some more hardcore stuff.. how
would you feel to get this one assigned?]

Cheers,
Sanne

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006