[infinispan-dev] Remote Query improvements

Tue Feb 11 08:53:42 EST 2014

In my experience people express a bit of confusion on their first
impact with our query/indexing technology as there is a strong
conceptual difference compared to the more familiar relational
databases.

The primary WTF effect is usually on the fact that when a field
included in a query is not indexed the query is not just "slower" but
it won't work at all; we have plans to compensate for that in the
scope of the simplified DSL (and remote queries) to fall-back to an
ad-hoc crafted map/reduce task which essentially implements a table
scan, but I'm thinking now that we should take it a step beyond and do
better.

Another source of trouble is that the fields need not just be indexed,
but also need to be indexed with the correct attributes depending on
the kind of query you mean to run: essentially this leads in practice
for people to need to have a very clear idea of which queries they
will be running.. and over the lifecycle of a complex application this
might become a complex to maintain, especially if you want to keep
peak performance you need to regularly cleanup indexing flags which
are related to queries no longer in use.

Nowadays we do some kind of validation of the queries to catch
situations in which these can't possibly match the metadata we have
about indexed fields, but this validation needs to be quite permissive
to not prevent rare and unusual advanced queries which are technically
valid, although potential candidates for a strong misunderstanding.

This all leads to a single clean solution: if we start from a
declarative set of query definitions, in which each query has the
specific extra metadata needed about their runtime execution (e.g.
using a specific Analyzer on a specific field, query time boosting,
hints about good candidates for filters), then we can actually get rid
of the need to define the indexing attributes at the schema level.

It would still be useful to maintain the current explicit control of
the indexing process: for example you might be building an index which
is consumed by a different application, or you simply know about an
advanced data mining feature that you're building on a custom
Query/Filtering/Collector which bypasses our helpful but constraining
query definition strategy.

Following this proposal, we wouldn't need to bother with extending the
document metadata with indexing annotations (annotations as a non-Java
term) but we'd need to focus on a way to pre-declare all queries users
intend to use.

I admit that this might sound limiting, but consider:
 - serialization of queries and all their potential advanced options
(not many in the remote case so far) needs to be done anyway, and
needs to be language agnostic anyway.
 - we'd be able to better validate complex query structures
 - when a user registers/unregisters "query definitions" from the
server we have a better opportunity to:
 -- cache parsing
 -- cache execution plans
 -- track metrics to improve on the execution plans
 -- adapt the indexes automatically (immediatelly or warn that it
needs to be done before the query is runnable)
 -- I suspect it would be easier to match queries with security ACLs,
both in terms of execution permission but also in terms of scoping on
a subset of the visible data (essentially I'm thinking that the
execution plans could be more advanced and prepare/hint about filter
caching and even adapt the indexing structure to better match the
security constraints).

# Essentially

We need to expose a standard, cross language and declarative form of
the queries the user intends to run from remote, and provide a way to
register these queries on the server, where
registration/deregistration triggers certain actions.

This would not be mandatory as you can still ask for ad-hoc queries,
but these will only take advantage of indexes which happen to exist
because of some registered query, or of no index at all.

I'm proposing for the format to be - initially to support only the
simple functions exposed by the remote DSL - a simple query String,
essentially the HQL we already use but obviously limited to the base
constraints we need. This language will probably evolve in future *if*
we ever want to expose also fulltext over this..

For the embedded query world - less of a priority - we could start
experimenting with more richer and typesafe query definitions, to also
provide the benefit listed above.

-- Sanne

On 11 February 2014 09:18, Mircea Markus <mmarkus at redhat.com> wrote:
> I guess I put the solution before the problem, but basically where I want to get to is to allow people to write protostream marshallers without requiring them to write the proto file. This would mean the same effort for java users to write either JBMAR marshallers or proto marshallers. If that's possible and people and protostream is as fast as JBMAR (do you have any perf numbers on that BTW?) then we can suggest people use proto marshallers by default.
>
> On Feb 10, 2014, at 6:43 PM, Adrian Nistor <anistor at redhat.com> wrote:
>
>> The idea of auto-generating protobuf schemas based on the marshaller
>> code was briefly mentioned last time we met in Palma. I would not
>> qualify it as impossible to implement, but it would certainly be hacky
>> and leads to more trouble than it's worth.
>>
>> A lot of info is missing from the marshaller code (API calls) precisely
>> because it is not normally needed, being provided by the schema already.
>> Now trying to go backwards means we'll have to 'invent' that metadata
>> using some common sense (examples: which field is required vs optional,
>> which field is indexable, indexing options, etc). Too many options. I
>> bet the notion of 'common sense' would quickly need to be configured
>> somehow, for uncommon use cases :). But that's why we have protobuf
>> schemas for. Plus, to run a marshaller for inferring the schema you'll
>> first need a prototypical instance of your entity. Where from? So no,
>> -1, now I have serious concerns about this, even though I initially
>> nodded in approval.
>>
>> And that would work only for Java anyway, because the marshaller and the
>> schema-infering-process needs to run on the server side.
>>
>>
>> On 02/10/2014 07:34 PM, Mircea Markus wrote:
>>> On Feb 10, 2014, at 4:54 PM, Tristan Tarrant <ttarrant at redhat.com> wrote:
>>>
>>>> - since remote query is already imbued with JPA in some form, an
>>>> interesting project would be to implement a JPA annotation processor
>>>> which can produce a set of ProtoBuf schemas from JPA-annotated classes.
>>>> - on top of the above, a ProtoBuf marshaller/unmarshaller which can use
>>>> the JPA entities directly.
>>> I think it would be even more useful to infer the protbuf schema from the protostream marshaller: the marshaller is required in order to serialize object into the proto format and has the advantage that it works even without Java.
>>>
>>> Cheers,
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> Cheers,
> --
> Mircea Markus
> Infinispan lead (www.infinispan.org)
>
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev