On 10 Apr 2013, at 18:29, Mircea Markus <mmarkus(a)redhat.com> wrote:
On 10 Apr 2013, at 17:45, Manik Surtani wrote:
> Yes. We haven't quite designed how remote querying will work, but we have a few
ideas.
Thanks for sharing :-)
> First, let me explain how in-VM indexing works. An object's fields are
appropriately annotated so that when it is stored in Infinispan with a put(), Hibernate
Search can extract the fields and values, flatten it into a Lucene-friendly
"document", and associate it with the entry's key for searching later.
>
> Now one approach to doing this when storing objects remotely is the serialisation
format. A format that can be parsed on the server side for easy indexing. An example of
this could be JSON (an appropriate transformation will need to exist on the server side to
strip out irrelevant fields before indexing). This would be completely
platform-independent, and also support the interop you described below. The drawback?
Slow JSON serialisation and deserialization, and a very verbose data stream.
What about using our own object definition, based on a fixed number of supported types:
e.g. int, long, , bigdecimal, String, Date and some more. Each client object would need to
implement the logic to serialize and deserialize itself into this format, using some
StremWriters, a bit like our serilizers today.
The StreamWritters would be provided be provided by us, for every supported programming
language, and would have methods like writeInt,writeLong etc.
Another nice thing we can add to this object scheme is versioning, which is useful for
rolling upgrades.
The server side would then index the known types using lucene. The client should be able
to define queries based on these objects and supported types (the query semantic to be
defined).
Disclaimer: not an original idea, there is already a similar approach used in other
datagrids providers.
Sounds a LOT like ProtoBufs. Or - yuck - CORBA. But generally, wheel-reinvention? Why
can't we use an existing library that provides this?
>
> Another approach may be to perform the field extraction on the client side, so that
the data sent to the server would be key=XXX (binary), value=YYY (binary),
indexing_metadata=ZZZ (JSON). This way the server does not need to be able to parse the
value for indexing, since the field data it needs is already provided in a
platform-independent manner (JSON). The benefit here is that keys and values can still be
binary, and can use an efficient marshaller. The drawback, is that field extraction needs
to happen on the client. Not hard for the Java client (bits of Hibernate Search could be
reused), but for non-Java clients this may increase complexity of those clients quite a
bit (much easier for dynamic language clients - python/ruby).
The client would need to build an lucene index itself and send it to the server, I guess
Sanne/Emmanuel can comment more on the complexity involved here.
Here are some limitations I see to this approach:
- cannot define an index at runtime. If we want to do that, the client would need to
storm all the data in the system and re-index it.
- cannot run a query for data that is not indexed. I think this is a pretty common
requirement as well.
> This approach does *not* solve your problem below, because for interop you will still
need a platform-independent serialisation mechanism like Avro or ProtoBufs for the object
<--> blob <--> object conversion.
Indeed. I think we should decide what approach we take and if we go for the former, not
even suggest Apache Avro but implement our own scheme.
See above. Why implement our own? Portable and efficient object serialisation is an
entire sub-field of computer science in itself; do we _really_ want to commit to building
and maintaining our own?
> Personally, I prefer the second approach since it separates
concerns (portable indexes vs. portable values) plus would lead to (IMO) a
better-performing implementation. I'd love to hear others' thoughts though.
I don't like the first approach because of the marshalling overhead. The former
You mean the latter?
seems complex, doesn't scale(requires the implementation of
indexing for every programming language) and limiting (indexes need to be defined a
priori, cannot query for non-indexed data).
>
> Cheers
> Manik
>
> On 10 Apr 2013, at 17:11, Mircea Markus <mmarkus(a)redhat.com> wrote:
>
>> That is write the Person object in Java and read a Person object in C#, assume a
hotrod client for simplicity.
>> Now at some point we'll have to run a query over the same hotrod, something
like "give me all the Persons named Mircea".
>> At this stage, the server side needs to be aware of the Person object in order to
be able to run the query and select the relevant Persons. It needs a schema. Instead of
suggesting Avro as an data interoperability protocol, we might want to define and use this
schema instead: we'd need it anyway for remote querying and we won't have two ways
of doing the same thing.
>> Thoughts?
>>
>> Cheers,
>> --
>> Mircea Markus
>> Infinispan lead (
www.infinispan.org)
>>
>>
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> manik(a)jboss.org
>
twitter.com/maniksurtani
>
> Platform Architect, JBoss Data Grid
>
http://red.ht/data-grid
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani
Platform Architect, JBoss Data Grid
http://red.ht/data-grid