[infinispan-dev] data interoperability and remote querying

Thu Apr 11 06:21:33 EDT 2013

Good points here (and in Emmanuel's follow-up).  I didn't consider re-indexing, which is very important.  The point you made below on multiple and inconsistent clients make a lot of sense as well, and generally a good thing as design philosophy to push the responsibility of metadata extraction, index creation and management to the same place where indexes are stored.  I.e., on the server.

Ok, so then that means we have much simpler clients.  Great.  But then it means we absolutely need a transparent and portable serialisation protocol.

More on this in a separate response.

On 10 Apr 2013, at 18:57, Sanne Grinovero <sanne at infinispan.org> wrote:

> Let's make it more complex ;-)
> 
> # Rebuilding the index
> 
> If the server is unable to extract the metata from the (binary) value,
> it won't be possible for it to rebuild the index. Indexes might need
> to be rebuilt for various reasons:
> - rolling upgrade: the index encoding changed in a new version
> - the index was corrupted and no backup is available (we don't really
> have a "dump index for backup" option anyway)
> - requirements on which parts of the data need to be indexed changed
> - requirements on HOW to index changed
> 
> # Indexing schema options
> 
> A common misconception is that we just need to know the property you
> want to be indexed. There are actually many options related on how
> this encoding needs to be performed.
> Let's make an example:
> 
> class Person {
>  String surname; <--Do you want case insensitive matches? Should we
> support Arabic characters?
>  int age; <- Are you going to need sort capabilities on this field?
> Range queries maybe? Do you know the exact min/max boundaries?
>  Date bornDate;  <- you milliseconds precision / minutes? Just day?
> (Let's even ignore timezone)
>  String notes;  <- Which language is this expected to be? Will you
> need auto-completion, synonym matches, More-Like-This functionality
> .... ... .. ?
> ..
> }
> 
> The key problem is not that you can't encode all answers to my
> question above in the metadata from the client side, but what to do
> with the existing data which is in the grid when the requirements
> change: for example you didn't initially need a RangeQuery on the age
> property, but then the application evolves and it needs. It would not
> be nice in such a case to need to clear() the grid and have the client
> re-dump all the state..
> 
> # Multiple clients / Inconsistent clients
> 
> One client might be uploading Person instances and generally need only
> exact matches on "surname", but then another client might need full
> text query on the "notes" field. Databases are a common point of
> information exchange between different applications (clients) and it
> must be possible to upgrade one external application (client) without
> requiring to update all other applications connected to the same grid.
> 
> Sanne
> 
> 
> On 10 April 2013 17:45, Manik Surtani <msurtani at redhat.com> wrote:
>> Yes.  We haven't quite designed how remote querying will work, but we have a few ideas.  First, let me explain  how in-VM indexing works.  An object's fields are appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate Search can extract the fields and values, flatten it into a Lucene-friendly "document", and associate it with the entry's key for searching later.
>> 
>> Now one approach to doing this when storing objects remotely is the serialisation format.  A format that can be parsed on the server side for easy indexing.  An example of this could be JSON (an appropriate transformation will need to exist on the server side to strip out irrelevant fields before indexing).  This would be completely platform-independent, and also support the interop you described below.  The drawback?  Slow JSON serialisation and deserialization, and a very verbose data stream.
>> 
>> Another approach may be to perform the field extraction on the client side, so that the data sent to the server would be key=XXX (binary), value=YYY (binary), indexing_metadata=ZZZ (JSON).  This way the server does not need to be able to parse the value for indexing, since the field data it needs is already provided in a platform-independent manner (JSON).  The benefit here is that keys and values can still be binary, and can use an efficient marshaller.  The drawback, is that field extraction needs to happen on the client.  Not hard for the Java client (bits of Hibernate Search could be reused), but for non-Java clients this may increase complexity of those clients quite a bit (much easier for dynamic language clients - python/ruby).  This approach does *not* solve your problem below, because for interop you will still need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object <--> blob <--> object conversion.
>> 
>> Personally, I prefer the second approach since it separates concerns (portable indexes vs. portable values) plus would lead to (IMO) a better-performing implementation.  I'd love to hear others' thoughts though.
>> 
>> Cheers
>> Manik
>> 
>> On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus at redhat.com> wrote:
>> 
>>> That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity.
>>> Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea".
>>> At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing.
>>> Thoughts?
>>> 
>>> Cheers,
>>> --
>>> Mircea Markus
>>> Infinispan lead (www.infinispan.org)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> --
>> Manik Surtani
>> manik at jboss.org
>> twitter.com/maniksurtani
>> 
>> Platform Architect, JBoss Data Grid
>> http://red.ht/data-grid
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
manik at jboss.org
twitter.com/maniksurtani

Platform Architect, JBoss Data Grid
http://red.ht/data-grid