[infinispan-dev] Clustered queries and custom indexes

Mon Dec 15 07:41:31 EST 2014

Hi Jiri, David,
I only briefly skimmed through the attachments to get an idea if the
content, but it looks great at first sight!
I'll read it in depth during the upcoming holidays.

But I think I can answer some of your questions already:

1) Yes it's certainly doable, if you have enough time for it :)
But you got our attention, we're certainly interested to help.

2) It's probably good value for Infinispan to work on an abstraction
from the specific indexing engine, although a poorly implemented
abstraction would cost us in terms of performance so we should get
that right. User's configuration complexity is also a frequent
concern, so let's try to keep that in mind too.
Once we have a proper separation from the current indexing/query
engine we can certainly add this as an alternative implementation;
this can live as an experimental module for a while and be integrated
depending on how far we get and how people like the additional
features.

3) In terms of design, I should probably read those papers in depth
first, but these are my early doubts:

# to Lucene / not to Lucene

I see in the presentation that Lucene is referred to as a good
solution for full-text, but while it's true it is actually just an
encoder/decoder/query engine for a vector space model. People have
built more than just text based Similarity on top of it.
Would this implementation be possible to run on top of Lucene indexes,
or is it required to use a completely different index management
solution?

# to Hibernate Search / not to Hibernate Search

Most of the current indexing/query code in Infinispan is based on
Hibernate Search, which handles the complexity of Lucene's resource
management, Query execution, and makes it easier for developers to map
their Domain model.
We're working on Hibernate Search to improve its flexibility on
dynamic models (more suited to Infinispan users), and also to not
necessarily work on Lucene in embedded mode but to delegate to "Lucene
like" services. That means it will probably always assume to have some
form of Similarity capable vector space model based engine to delegate
the hard work to, but not necessarily the Lucene project; we're
looking at alternatives like Apache Solr and ElasticSearch for now -
so essentially still Lucene based but typically running on a separate
dedicated cluster node(s).

You could think of integrating the index handling code into Hibernate
Search, whose functionality is automatically inherited by Infinispan,
or bypass Hibernate Search and integrate with Infinispan directly.
Depending on the "Lucene" question, be aware that Hibernate Search is
already able to provide functionality like Spatial queries and
indexing of PDF/Office files; although this last one is text based,
the Spatial integration works on numeric distance; the benefit is that
we can combine distance criteria with text criteria. I don't think it
would be hard to extend this model to support other implementations of
Similarity like the mentioned images and songs, in fact that would
probably be a relatively easy task if you already know which
Similarity implementation you want to use.
The benefit of integrating with Hibernate Search is that you would
address the needs of a much larger users base: the same functionality
is usable by Hibernate users (Java developers using relational
databases: we provide indexing an Similarity based queries on your
database stored data).

I'm just listing some options but don't intend to recommend any
without further details. While I'm leading the Hibernate Search
project, I see good value in a proper abstraction from Infinispan to a
pluggable (alternative) query strategy, although considering how many
details it takes to get right I doubt we'll ever be able to make an
effective competitor for the current one; so to answer the two points
we'd need a better understanding of what exactly you would need to
store in the "index" and how you think this can be maintained in synch
with the data.

Generally speaking I think all newcomers will be tempted to avoid both
Lucene and Hibernate Search to not need to learn too much, but let's
keep in mind that not having unlimited manpower we need to be smart
and these two engines do a lot of heavy work and are constantly
evolving in terms of performance. So unless the requirements don't fit
at all, I'd rather help to see what could be reused from these.
I haven't done much advanced research using Lucene myself, but I've
heard that several researchers use it as a "toolbox" to experiment
with new kinds of vector space based analytics, so I expect it should
be useful to keep around even in an alternative implementation.

Thanks,
Sanne

On 15 December 2014 at 11:35, Jiri Holusa <jholusa at redhat.com> wrote:
> Hi,
>
> there is an interesting research around similarity search at my university driven by David Novák (CC-ed). If anyone interested, see [1][2][3].
>
> Shortly: they basically achieved similarity search on any data (images, songs, etc...) by creating some sort of custom index, that stores a "similarity vector" for each object in the database. This index can solve queries like "give me the most similar images to this example". So why am I posting this here?
>
> The architecture is designed on top of Infinispan and they want to use it to speed it up. Basically, they would like to distribute the entries across the cluster, each node would have the similarity index of its entries. Then, when a query comes, it would be distributed to all the nodes, custom search would be performed on the node's indexes and the result returned. This is approximately what Index.LOCAL and ClusteredQuery could do.
>
> The difference is that the indexing and searching mechanism must be custom. So I wanted to ask what do you think about implementing such a feature to Infinispan. I was thinking about somehow extracting general API for indexing/searching, then e.g. our Lucene search would become its implementation.
>
> I would be happy to take this as a contribution, since I find this extremely interesting topic and also create a diploma thesis out of this.
> So here are some questions:
> 1) Is it doable?
> 2) Do we want this feature?
> 3) How to design it/where to start?
>
> Any input is more then welcome :)
>
> Cheers,
> Jiri
>
> [1] https://drive.google.com/file/d/0B4sztQSfpi3rRlJBQjJHMkR2LXc/view
> [2] https://drive.google.com/file/d/0B4sztQSfpi3rU2p2MV9jRE9iTUk/view
> [3] https://drive.google.com/file/d/0B4sztQSfpi3rZUpld24ydzJNclk/view
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev