data interoperability and remote querying

Unexpected value returned from the...

GSoC 2013 - .NET Hot Rod Client

Mircea Markus

Wednesday, 10 April 2013 Wed, 10 Apr '13

11:11 a.m.

That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity. Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea". At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing. Thoughts? Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Show replies by date

Manik Surtani

Wednesday, 10 April Wed, 10 Apr

11:45 a.m.

...

-- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid

Emmanuel Bernard

12:18 p.m.

I favor the first options for a few reasons: - much easier client side implementations Frankly rewriting the analyzer logic of Lucene in every languages is not a piece of cake and you are out of luck for custom analyzers - more robust client implementation: if we change how indexing is done clients don't have to change - reindexing: if there is a need to rebuild the index, or if the user decides to reindex data differently, you must be able to read the data on the server side - validation: if you want to implement (cross entry) validation, the server needs to be able to read the data. - async, validation and indexing can be done in an async way on the server and avoid perceived latency from a client requiest to the result I'm not sure JSON should be the format though. As you said it's quite verbose and string is not exactly the most efficient way to process data. Emmanuel On Wed 2013-04-10 17:45, Manik Surtani wrote:

...

Yes. We haven't quite designed how remote querying will work, but we have a few ideas. First, let me explain how in-VM indexing works. An object's fields are appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate Search can extract the fields and values, flatten it into a Lucene-friendly "document", and associate it with the entry's key for searching later. Now one approach to doing this when storing objects remotely is the serialisation format. A format that can be parsed on the server side for easy indexing. An example of this could be JSON (an appropriate transformation will need to exist on the server side to strip out irrelevant fields before indexing). This would be completely platform-independent, and also support the interop you described below. The drawback? Slow JSON serialisation and deserialization, and a very verbose data stream. Another approach may be to perform the field extraction on the client side, so that the data sent to the server would be key=XXX (binary), value=YYY (binary), indexing_metadata=ZZZ (JSON). This way the server does not need to be able to parse the value for indexing, since the field data it needs is already provided in a platform-independent manner (JSON). The benefit here is that keys and values can still be binary, and can use an efficient marshaller. The drawback, is that field extraction needs to happen on the client. Not hard for the Java client (bits of Hibernate Search could be reused), but for non-Java clients this may increase complexity of those clients quite a bit (much easier for dynamic language clients - python/ruby). This approach does *not* solve your problem below, because for interop you will still need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object <--> blob <--> object conversion. Personally, I prefer the second approach since it separates concerns (portable indexes vs. portable values) plus would lead to (IMO) a better-performing implementation. I'd love to hear others' thoughts though. Cheers Manik On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus(a)redhat.com> wrote: > That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity. > Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea". > At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing. > Thoughts? > > Cheers, > -- > Mircea Markus > Infinispan lead (www.infinispan.org) > > > > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Mircea Markus

12:30 p.m.

On 10 Apr 2013, at 18:18, Emmanuel Bernard wrote:

...

+1 to all the points above. Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Manik Surtani

12:55 p.m.

On 10 Apr 2013, at 18:18, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

I'm not suggesting all the analyser logic. Just the extraction of indexed fields into name/value pairs, to be sent alongside the blob value.

...

- more robust client implementation: if we change how indexing is done clients don't have to change - reindexing: if there is a need to rebuild the index, or if the user decides to reindex data differently, you must be able to read the data on the server side - validation: if you want to implement (cross entry) validation, the server needs to be able to read the data. - async, validation and indexing can be done in an async way on the server and avoid perceived latency from a client requiest to the result

Valid points above though.

...

I'm not sure JSON should be the format though. As you said it's quite verbose and string is not exactly the most efficient way to process data.

What would that format be, then?

...

Emmanuel On Wed 2013-04-10 17:45, Manik Surtani wrote: > Yes. We haven't quite designed how remote querying will work, but we have a few ideas. First, let me explain how in-VM indexing works. An object's fields are appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate Search can extract the fields and values, flatten it into a Lucene-friendly "document", and associate it with the entry's key for searching later. > > Now one approach to doing this when storing objects remotely is the serialisation format. A format that can be parsed on the server side for easy indexing. An example of this could be JSON (an appropriate transformation will need to exist on the server side to strip out irrelevant fields before indexing). This would be completely platform-independent, and also support the interop you described below. The drawback? Slow JSON serialisation and deserialization, and a very verbose data stream. > > Another approach may be to perform the field extraction on the client side, so that the data sent to the server would be key=XXX (binary), value=YYY (binary), indexing_metadata=ZZZ (JSON). This way the server does not need to be able to parse the value for indexing, since the field data it needs is already provided in a platform-independent manner (JSON). The benefit here is that keys and values can still be binary, and can use an efficient marshaller. The drawback, is that field extraction needs to happen on the client. Not hard for the Java client (bits of Hibernate Search could be reused), but for non-Java clients this may increase complexity of those clients quite a bit (much easier for dynamic language clients - python/ruby). This approach does *not* solve your problem below, because for interop you will still need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object <--> blob <--> object conversion. > > Personally, I prefer the second approach since it separates concerns (portable indexes vs. portable values) plus would lead to (IMO) a better-performing implementation. I'd love to hear others' thoughts though. > > Cheers > Manik > > On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus(a)redhat.com> wrote: > >> That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity. >> Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea". >> At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing. >> Thoughts? >> >> Cheers, >> -- >> Mircea Markus >> Infinispan lead (www.infinispan.org) >> >> >> >> >> >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev(a)lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > -- > Manik Surtani > manik(a)jboss.org > twitter.com/maniksurtani > > Platform Architect, JBoss Data Grid > http://red.ht/data-grid > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid

Emmanuel Bernard

1:46 p.m.

On Wed 2013-04-10 18:55, Manik Surtani wrote:

...

On 10 Apr 2013, at 18:18, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > I favor the first options for a few reasons: > > - much easier client side implementations > Frankly rewriting the analyzer logic of Lucene in every languages is > not a piece of cake and you are out of luck for custom analysers I'm not suggesting all the analyser logic. Just the extraction of indexed fields into name/value pairs, to be sent alongside the blob value.

Which means you make a selection already and possibly already reduce your precision for a given field. Which makes reindexing impossible.

...

> - more robust client implementation: if we change how indexing is done > clients don't have to change > - reindexing: if there is a need to rebuild the index, or if the user > decides to reindex data differently, you must be able to read the data > on the server side > - validation: if you want to implement (cross entry) validation, the > server needs to be able to read the data. > - async, validation and indexing can be done in an async way on the > server and avoid perceived latency from a client requiest to the > result Valid points above though. > I'm not sure JSON should be the format though. As you said it's quite > verbose and string is not exactly the most efficient way to process > data. What would that format be, then?

Good question :) BSON is not necessarily smaller than JSON, it is meant to be more parseable afair. I did use Avro in Hibernate Search as I find ProtBuffer and the others too rigid for my needs to pass arbitrary datasets. But if we have a schema and expect a given object type, then we can start saving space a lot. On other words, no idea that needs to be investigated.

Randall Hauch

2:51 p.m.

On Apr 10, 2013, at 1:46 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

>> I'm not sure JSON should be the format though. As you said it's quite >> verbose and string is not exactly the most efficient way to process >> data. > > What would that format be, then? Good question :) BSON is not necessarily smaller than JSON, it is meant to be more parseable afair. I did use Avro in Hibernate Search as I find ProtBuffer and the others too rigid for my needs to pass arbitrary datasets. But if we have a schema and expect a given object type, then we can start saving space a lot.

Actually, I would suspect that the JSON compresses much smaller than the size of the BSON. The advantage of BSON, however, is the additional types that are supported, including binary, timestamps, etc.

Sanne Grinovero

2:57 p.m.

...

On Wed 2013-04-10 18:55, Manik Surtani wrote: > > On 10 Apr 2013, at 18:18, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > > > I favor the first options for a few reasons: > > > > - much easier client side implementations > > Frankly rewriting the analyzer logic of Lucene in every languages is > > not a piece of cake and you are out of luck for custom analysers > > I'm not suggesting all the analyser logic. Just the extraction of indexed fields into name/value pairs, to be sent alongside the blob value. Which means you make a selection already and possibly already reduce your precision for a given field. Which makes reindexing impossible.

+1 It also adds larger payloads, and complexity and overhead to the clients, while the user might not be able to scale the client compute capability as it can with the data grid.

...

> > > - more robust client implementation: if we change how indexing is done > > clients don't have to change > > - reindexing: if there is a need to rebuild the index, or if the user > > decides to reindex data differently, you must be able to read the data > > on the server side > > - validation: if you want to implement (cross entry) validation, the > > server needs to be able to read the data. > > - async, validation and indexing can be done in an async way on the > > server and avoid perceived latency from a client requiest to the > > result > > Valid points above though. > > > I'm not sure JSON should be the format though. As you said it's quite > > verbose and string is not exactly the most efficient way to process > > data. > > What would that format be, then? Good question :) BSON is not necessarily smaller than JSON, it is meant to be more parseable afair. I did use Avro in Hibernate Search as I find ProtBuffer and the others too rigid for my needs to pass arbitrary datasets. But if we have a schema and expect a given object type, then we can start saving space a lot. On other words, no idea that needs to be investigated.

Right, let's keep this to collecting requirements: - being able to upgrade the server without losing data - being able to change the (soft) schema on the server - read/write fields from different languages - deal with multi-version control of values (i.e. being able to read an older value through an evoluted schema, doing comparisons of same value even if it was stored using different schema generations) Sanne

Randall Hauch

3:42 p.m.

Although I think generally the indexing functionality should be transparent to clients, ModeShape does need more control over how the indexable information is extracted from the cached values. Therefore, it would be great if there were a way for clients to specify the actual "metadata" representation (perhaps another POJO) that could be processed as discussed earlier. The simple reason why ModeShape needs something like this is that the value objects that ModeShape puts into the Infinispan cache are DeltaAware objects that each wrap a single a JSON/BSON document, and there's no POJO with annotations that Hibernate Search can directly understand. Also, the fields within the JSON/BSON documents contain namespaced values, and ModeShape's namespace registry can change at any time, so any "bridge" object created by Infinispan would need a reference to the ModeShape repository instance. On Apr 10, 2013, at 2:57 PM, Sanne Grinovero <sanne(a)infinispan.org> wrote:

...

Weird, when I wrote my previous reply there where no other answers and the rest of the thread appeared to me just now. Good to see that Emmanuel had replied highlighting the same problems.. we can continue from there on this topic, just read mine to understand that there are a lot of options that need to be defined for each field: specifying it's a "varchar" is not enough. some more thoughts inline: On 10 April 2013 19:46, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > On Wed 2013-04-10 18:55, Manik Surtani wrote: >> >> On 10 Apr 2013, at 18:18, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: >> >>> I favor the first options for a few reasons: >>> >>> - much easier client side implementations >>> Frankly rewriting the analyzer logic of Lucene in every languages is >>> not a piece of cake and you are out of luck for custom analysers >> >> I'm not suggesting all the analyser logic. Just the extraction of indexed fields into name/value pairs, to be sent alongside the blob value. > > Which means you make a selection already and possibly already reduce > your precision for a given field. Which makes reindexing impossible. +1 It also adds larger payloads, and complexity and overhead to the clients, while the user might not be able to scale the client compute capability as it can with the data grid. > >> >>> - more robust client implementation: if we change how indexing is done >>> clients don't have to change >>> - reindexing: if there is a need to rebuild the index, or if the user >>> decides to reindex data differently, you must be able to read the data >>> on the server side >>> - validation: if you want to implement (cross entry) validation, the >>> server needs to be able to read the data. >>> - async, validation and indexing can be done in an async way on the >>> server and avoid perceived latency from a client requiest to the >>> result >> >> Valid points above though. >> >>> I'm not sure JSON should be the format though. As you said it's quite >>> verbose and string is not exactly the most efficient way to process >>> data. >> >> What would that format be, then? > > Good question :) BSON is not necessarily smaller than JSON, it is meant > to be more parseable afair. I did use Avro in Hibernate Search as I find > ProtBuffer and the others too rigid for my needs to pass arbitrary > datasets. But if we have a schema and expect a given object type, then > we can start saving space a lot. > On other words, no idea that needs to be investigated. Right, let's keep this to collecting requirements: - being able to upgrade the server without losing data - being able to change the (soft) schema on the server - read/write fields from different languages - deal with multi-version control of values (i.e. being able to read an older value through an evoluted schema, doing comparisons of same value even if it was stored using different schema generations) Sanne _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Manik Surtani

Thursday, 11 April Thu, 11 Apr

5:30 a.m.

On 10 Apr 2013, at 20:57, Sanne Grinovero <sanne(a)infinispan.org> wrote:

...

Right, let's keep this to collecting requirements:

+1. Ok, so it seems we're all pretty much in agreement that metadata extraction and indexing should happen on the server side and not on the client. As I said before, this is good. Simple clients, support for re-indexing, support for changes in indexing characteristics, and the ability to save the world from AIDS. This puts a requirement on an efficient and portable serialisation format. Again, +1 to starting with defining what we need. Good start below, Sanne.

...

- being able to upgrade the server without losing data - being able to change the (soft) schema on the server - read/write fields from different languages - deal with multi-version control of values (i.e. being able to read an older value through an evoluted schema, doing comparisons of same value even if it was stored using different schema generations)

I'd add: * Support for fast and easy translation to/from object model in high level language of choice (i.e., not manual parsing! Maybe some form of tooling, like a Maven plugin, to generate "IDL"-esque format) * Serialisation efficiency (size and speed) should be considered And in addition, I'd also list out existing technologies that fulfil some or all of these requirements that we can consider, look at extending, etc. - Manik -- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid

Dan Berindei

6:35 a.m.

On Thu, Apr 11, 2013 at 1:30 PM, Manik Surtani <msurtani(a)redhat.com> wrote:

...

On 10 Apr 2013, at 20:57, Sanne Grinovero <sanne(a)infinispan.org> wrote: > Right, let's keep this to collecting requirements: +1. Ok, so it seems we're all pretty much in agreement that metadata extraction and indexing should happen on the server side and not on the client. As I said before, this is good. Simple clients, support for re-indexing, support for changes in indexing characteristics, and the ability to save the world from AIDS. This puts a requirement on an efficient and portable serialisation format. Again, +1 to starting with defining what we need. Good start below, Sanne.

Besides the serialization format, how do we want to define the indexes on the server? Relying on Java classes with Lucene annotations on them doesn't sound like it would support indexing changes very well, because each node would index whatever annotations it had loaded at the moment. So I guess we need a separate indexing configuration, modifiable at runtime, and with annotations as a backup.

...

> - being able to upgrade the server without losing data > - being able to change the (soft) schema on the server > - read/write fields from different languages - deal with multi-version control of values (i.e. being able to read > an older value through an evoluted schema, doing comparisons of same > value even if it was stored using different schema generations) I'd add: * Support for fast and easy translation to/from object model in high level language of choice (i.e., not manual parsing! Maybe some form of tooling, like a Maven plugin, to generate "IDL"-esque format) * Serialisation efficiency (size and speed) should be considered And in addition, I'd also list out existing technologies that fulfil some or all of these requirements that we can consider, look at extending, etc.

I'd add support for random access for reads. If the user only needs to index a Person's date of birth, it would be nice if we could read only the dateOfBirth field and index that.

...

- Manik -- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Sanne Grinovero

7:30 a.m.

On 11 April 2013 12:35, Dan Berindei <dan.berindei(a)gmail.com> wrote:

...

On Thu, Apr 11, 2013 at 1:30 PM, Manik Surtani <msurtani(a)redhat.com> wrote: > > > On 10 Apr 2013, at 20:57, Sanne Grinovero <sanne(a)infinispan.org> wrote: > > > Right, let's keep this to collecting requirements: > > +1. Ok, so it seems we're all pretty much in agreement that metadata > extraction and indexing should happen on the server side and not on the > client. As I said before, this is good. Simple clients, support for > re-indexing, support for changes in indexing characteristics, and the > ability to save the world from AIDS. > > This puts a requirement on an efficient and portable serialisation format. > Again, +1 to starting with defining what we need. Good start below, Sanne. > Besides the serialization format, how do we want to define the indexes on the server? Relying on Java classes with Lucene annotations on them doesn't sound like it would support indexing changes very well, because each node would index whatever annotations it had loaded at the moment. So I guess we need a separate indexing configuration, modifiable at runtime, and with annotations as a backup.

You're right. Something like this: https://github.com/hibernate/hibernate-search/blob/master/hibernate-searc...

...

> > > - being able to upgrade the server without losing data > > - being able to change the (soft) schema on the server > > - read/write fields from different languages > > > - deal with multi-version control of values (i.e. being able to read > > an older value through an evoluted schema, doing comparisons of same > > value even if it was stored using different schema generations) > > I'd add: > > * Support for fast and easy translation to/from object model in high level > language of choice (i.e., not manual parsing! Maybe some form of tooling, > like a Maven plugin, to generate "IDL"-esque format) > * Serialisation efficiency (size and speed) should be considered > > And in addition, I'd also list out existing technologies that fulfil some > or all of these requirements that we can consider, look at extending, etc. > I'd add support for random access for reads. If the user only needs to index a Person's date of birth, it would be nice if we could read only the dateOfBirth field and index that. > > - Manik

Mircea Markus

Wednesday, 10 April Wed, 10 Apr

12:29 p.m.

On 10 Apr 2013, at 17:45, Manik Surtani wrote:

...

Yes. We haven't quite designed how remote querying will work, but we have a few ideas.

Thanks for sharing :-)

...

First, let me explain how in-VM indexing works. An object's fields are appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate Search can extract the fields and values, flatten it into a Lucene-friendly "document", and associate it with the entry's key for searching later. Now one approach to doing this when storing objects remotely is the serialisation format. A format that can be parsed on the server side for easy indexing. An example of this could be JSON (an appropriate transformation will need to exist on the server side to strip out irrelevant fields before indexing). This would be completely platform-independent, and also support the interop you described below. The drawback? Slow JSON serialisation and deserialization, and a very verbose data stream.

What about using our own object definition, based on a fixed number of supported types: e.g. int, long, , bigdecimal, String, Date and some more. Each client object would need to implement the logic to serialize and deserialize itself into this format, using some StremWriters, a bit like our serilizers today. The StreamWritters would be provided be provided by us, for every supported programming language, and would have methods like writeInt,writeLong etc. Another nice thing we can add to this object scheme is versioning, which is useful for rolling upgrades. The server side would then index the known types using lucene. The client should be able to define queries based on these objects and supported types (the query semantic to be defined). Disclaimer: not an original idea, there is already a similar approach used in other datagrids providers.

...

Another approach may be to perform the field extraction on the client side, so that the data sent to the server would be key=XXX (binary), value=YYY (binary), indexing_metadata=ZZZ (JSON). This way the server does not need to be able to parse the value for indexing, since the field data it needs is already provided in a platform-independent manner (JSON). The benefit here is that keys and values can still be binary, and can use an efficient marshaller. The drawback, is that field extraction needs to happen on the client. Not hard for the Java client (bits of Hibernate Search could be reused), but for non-Java clients this may increase complexity of those clients quite a bit (much easier for dynamic language clients - python/ruby).

The client would need to build an lucene index itself and send it to the server, I guess Sanne/Emmanuel can comment more on the complexity involved here. Here are some limitations I see to this approach: - cannot define an index at runtime. If we want to do that, the client would need to storm all the data in the system and re-index it. - cannot run a query for data that is not indexed. I think this is a pretty common requirement as well.

...

This approach does *not* solve your problem below, because for interop you will still need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object <--> blob <--> object conversion.

Indeed. I think we should decide what approach we take and if we go for the former, not even suggest Apache Avro but implement our own scheme.

...

Personally, I prefer the second approach since it separates concerns (portable indexes vs. portable values) plus would lead to (IMO) a better-performing implementation. I'd love to hear others' thoughts though.

I don't like the first approach because of the marshalling overhead. The former seems complex, doesn't scale(requires the implementation of indexing for every programming language) and limiting (indexes need to be defined a priori, cannot query for non-indexed data).

...

Cheers Manik On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus(a)redhat.com> wrote: > That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity. > Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea". > At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing. > Thoughts? > > Cheers, > -- > Mircea Markus > Infinispan lead (www.infinispan.org) > > > > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org)

Manik Surtani

Thursday, 11 April Thu, 11 Apr

5:24 a.m.

On 10 Apr 2013, at 18:29, Mircea Markus <mmarkus(a)redhat.com> wrote:

...

On 10 Apr 2013, at 17:45, Manik Surtani wrote: > Yes. We haven't quite designed how remote querying will work, but we have a few ideas. Thanks for sharing :-) > First, let me explain how in-VM indexing works. An object's fields are appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate Search can extract the fields and values, flatten it into a Lucene-friendly "document", and associate it with the entry's key for searching later. > > Now one approach to doing this when storing objects remotely is the serialisation format. A format that can be parsed on the server side for easy indexing. An example of this could be JSON (an appropriate transformation will need to exist on the server side to strip out irrelevant fields before indexing). This would be completely platform-independent, and also support the interop you described below. The drawback? Slow JSON serialisation and deserialization, and a very verbose data stream. What about using our own object definition, based on a fixed number of supported types: e.g. int, long, , bigdecimal, String, Date and some more. Each client object would need to implement the logic to serialize and deserialize itself into this format, using some StremWriters, a bit like our serilizers today. The StreamWritters would be provided be provided by us, for every supported programming language, and would have methods like writeInt,writeLong etc. Another nice thing we can add to this object scheme is versioning, which is useful for rolling upgrades. The server side would then index the known types using lucene. The client should be able to define queries based on these objects and supported types (the query semantic to be defined). Disclaimer: not an original idea, there is already a similar approach used in other datagrids providers.

Sounds a LOT like ProtoBufs. Or - yuck - CORBA. But generally, wheel-reinvention? Why can't we use an existing library that provides this?

...

> > Another approach may be to perform the field extraction on the client side, so that the data sent to the server would be key=XXX (binary), value=YYY (binary), indexing_metadata=ZZZ (JSON). This way the server does not need to be able to parse the value for indexing, since the field data it needs is already provided in a platform-independent manner (JSON). The benefit here is that keys and values can still be binary, and can use an efficient marshaller. The drawback, is that field extraction needs to happen on the client. Not hard for the Java client (bits of Hibernate Search could be reused), but for non-Java clients this may increase complexity of those clients quite a bit (much easier for dynamic language clients - python/ruby). The client would need to build an lucene index itself and send it to the server, I guess Sanne/Emmanuel can comment more on the complexity involved here. Here are some limitations I see to this approach: - cannot define an index at runtime. If we want to do that, the client would need to storm all the data in the system and re-index it. - cannot run a query for data that is not indexed. I think this is a pretty common requirement as well. > This approach does *not* solve your problem below, because for interop you will still need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object <--> blob <--> object conversion. Indeed. I think we should decide what approach we take and if we go for the former, not even suggest Apache Avro but implement our own scheme.

See above. Why implement our own? Portable and efficient object serialisation is an entire sub-field of computer science in itself; do we _really_ want to commit to building and maintaining our own?

...

> Personally, I prefer the second approach since it separates concerns (portable indexes vs. portable values) plus would lead to (IMO) a better-performing implementation. I'd love to hear others' thoughts though. I don't like the first approach because of the marshalling overhead. The former

You mean the latter?

...

seems complex, doesn't scale(requires the implementation of indexing for every programming language) and limiting (indexes need to be defined a priori, cannot query for non-indexed data).

...

> > Cheers > Manik > > On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus(a)redhat.com> wrote: > >> That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity. >> Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea". >> At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing. >> Thoughts? >> >> Cheers, >> -- >> Mircea Markus >> Infinispan lead (www.infinispan.org) >> >> >> >> >> >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev(a)lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > -- > Manik Surtani > manik(a)jboss.org > twitter.com/maniksurtani > > Platform Architect, JBoss Data Grid > http://red.ht/data-grid > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid

Sanne Grinovero

Wednesday, 10 April Wed, 10 Apr

12:57 p.m.

...

Manik Surtani

Thursday, 11 April Thu, 11 Apr

5:21 a.m.

Good points here (and in Emmanuel's follow-up). I didn't consider re-indexing, which is very important. The point you made below on multiple and inconsistent clients make a lot of sense as well, and generally a good thing as design philosophy to push the responsibility of metadata extraction, index creation and management to the same place where indexes are stored. I.e., on the server. Ok, so then that means we have much simpler clients. Great. But then it means we absolutely need a transparent and portable serialisation protocol. More on this in a separate response. On 10 Apr 2013, at 18:57, Sanne Grinovero <sanne(a)infinispan.org> wrote:

...

Let's make it more complex ;-) # Rebuilding the index If the server is unable to extract the metata from the (binary) value, it won't be possible for it to rebuild the index. Indexes might need to be rebuilt for various reasons: - rolling upgrade: the index encoding changed in a new version - the index was corrupted and no backup is available (we don't really have a "dump index for backup" option anyway) - requirements on which parts of the data need to be indexed changed - requirements on HOW to index changed # Indexing schema options A common misconception is that we just need to know the property you want to be indexed. There are actually many options related on how this encoding needs to be performed. Let's make an example: class Person { String surname; <--Do you want case insensitive matches? Should we support Arabic characters? int age; <- Are you going to need sort capabilities on this field? Range queries maybe? Do you know the exact min/max boundaries? Date bornDate; <- you milliseconds precision / minutes? Just day? (Let's even ignore timezone) String notes; <- Which language is this expected to be? Will you need auto-completion, synonym matches, More-Like-This functionality .... ... .. ? .. } The key problem is not that you can't encode all answers to my question above in the metadata from the client side, but what to do with the existing data which is in the grid when the requirements change: for example you didn't initially need a RangeQuery on the age property, but then the application evolves and it needs. It would not be nice in such a case to need to clear() the grid and have the client re-dump all the state.. # Multiple clients / Inconsistent clients One client might be uploading Person instances and generally need only exact matches on "surname", but then another client might need full text query on the "notes" field. Databases are a common point of information exchange between different applications (clients) and it must be possible to upgrade one external application (client) without requiring to update all other applications connected to the same grid. Sanne On 10 April 2013 17:45, Manik Surtani <msurtani(a)redhat.com> wrote: > Yes. We haven't quite designed how remote querying will work, but we have a few ideas. First, let me explain how in-VM indexing works. An object's fields are appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate Search can extract the fields and values, flatten it into a Lucene-friendly "document", and associate it with the entry's key for searching later. > > Now one approach to doing this when storing objects remotely is the serialisation format. A format that can be parsed on the server side for easy indexing. An example of this could be JSON (an appropriate transformation will need to exist on the server side to strip out irrelevant fields before indexing). This would be completely platform-independent, and also support the interop you described below. The drawback? Slow JSON serialisation and deserialization, and a very verbose data stream. > > Another approach may be to perform the field extraction on the client side, so that the data sent to the server would be key=XXX (binary), value=YYY (binary), indexing_metadata=ZZZ (JSON). This way the server does not need to be able to parse the value for indexing, since the field data it needs is already provided in a platform-independent manner (JSON). The benefit here is that keys and values can still be binary, and can use an efficient marshaller. The drawback, is that field extraction needs to happen on the client. Not hard for the Java client (bits of Hibernate Search could be reused), but for non-Java clients this may increase complexity of those clients quite a bit (much easier for dynamic language clients - python/ruby). This approach does *not* solve your problem below, because for interop you will still need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object <--> blob <--> object conversion. > > Personally, I prefer the second approach since it separates concerns (portable indexes vs. portable values) plus would lead to (IMO) a better-performing implementation. I'd love to hear others' thoughts though. > > Cheers > Manik > > On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus(a)redhat.com> wrote: > >> That is write the Person object in Java and read a Person object in C#, assume a hotrod client for simplicity. >> Now at some point we'll have to run a query over the same hotrod, something like "give me all the Persons named Mircea". >> At this stage, the server side needs to be aware of the Person object in order to be able to run the query and select the relevant Persons. It needs a schema. Instead of suggesting Avro as an data interoperability protocol, we might want to define and use this schema instead: we'd need it anyway for remote querying and we won't have two ways of doing the same thing. >> Thoughts? >> >> Cheers, >> -- >> Mircea Markus >> Infinispan lead (www.infinispan.org) >> >> >> >> >> >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev(a)lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > -- > Manik Surtani > manik(a)jboss.org > twitter.com/maniksurtani > > Platform Architect, JBoss Data Grid > http://red.ht/data-grid > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- Manik Surtani manik(a)jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid

4843

days inactive

4844

days old

infinispan-dev@lists.jboss.org

Manage subscription

15 comments

6 participants

tags (0)

participants (6)

Dan Berindei
Emmanuel Bernard
Manik Surtani
Mircea Markus
Randall Hauch
Sanne Grinovero

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

data interoperability and remote querying