[infinispan-dev] [infinispan-internal] Continuous Queries

Mircea Markus mmarkus at redhat.com
Fri Oct 18 08:36:22 EDT 2013


On Oct 18, 2013, at 1:06 PM, Sanne Grinovero <sanne at infinispan.org> wrote:

> On 18 October 2013 12:12, Mircea Markus <mmarkus at redhat.com> wrote:
>> 
>> On Oct 17, 2013, at 11:29 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
>> 
>>> On 17 October 2013 20:19, Mircea Markus <mmarkus at redhat.com> wrote:
>>>> let's keep this on -dev.
>>> 
>>> +1
>>> 
>>>> On Oct 17, 2013, at 6:24 PM, Sanne Grinovero <sanne at redhat.com> wrote:
>>>>> ----- Original Message -----
>>>>>> 
>>>>>> On Oct 17, 2013, at 2:28 PM, Sanne Grinovero <sanne at redhat.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>>> On Oct 17, 2013, at 1:31 PM, Sanne Grinovero <sanne at redhat.com> wrote:
>>>>>>>> 
>>>>>>>>> With some custom coding it's certainly possible to define an event
>>>>>>>>> listener
>>>>>>>>> which triggers when an entry is inserted/removed which matches a certain
>>>>>>>>> Query.
>>>>>>>> 
>>>>>>>> where would hold the  the query result? a cache perhaps?
>>>>>>> 
>>>>>>> Why do you need to hold on to the query result?
>>>>>>> I was thinking to just send an event "newly stored X matches query Q1".
>>>>>> 
>>>>>> You don't have a single process receive all the notifications then, but
>>>>>> multiple processes in the cluster. It's up to the user to aggregate these
>>>>>> results (that's why I mentioned a cache) but without aggregation this
>>>>>> feature is pretty limiting.
>>>>> 
>>>>> I have no idea if it's limiting. For the use case I understood, that's pretty decent.
>>>> 
>>>> Here's my understanding of CQ[1]: a user queries a cache 10000000( you add the rest of 0) per second.
>>>> Instead of executing the query every time (very resource consuming) the system caches the query result, update it when underlying data gets modified, and return to the user on every invocation. Optionally you can register a listener on the query result, but that's just API sugar.
>>> 
>>> That's an implementation detail, I need a use case.
>>> 
>>> Assuming you store a good amount of entries, you know, maybe so many
>>> that I actually need a data grid instead of a simple HashMap or a USB
>>> stick, as a Query user I don't think I would always want to actually
>>> fetch locally all data, when all I need is maybe sound an alarm bell.
>>> 
>>> A use case could be that I'm interested in some stock, specifically I
>>> want to be notified ASAP for course changes for the stock traded on
>>> market "Neverland", so I register a continuous query "from stock where
>>> stock.market = 'Neverland' ".
>>> Let's also assume that Neverland trades approximately 5,000 titles.
>>> 
>>> My application starts and fetches all current values with a one-off
>>> full query (using that same query), so I fetch all 5,000 locally. Next
>>> step, I want to be notified ASAP when one of these change value, so
>>> that I can react on it.
>>> Then I get my first notification! cool, my nice List API provides me
>>> with the new value for 5,000 titles.. which one changed? let me find
>>> out, I can scan on my previous results and find out..
>>> (Note that I'm not even getting into the detail of how we got all
>>> those titles locally: using deltas or not is irrelevant).
>>> 
>>> That's certainly doable, but what if you have more than 5,000 titles..
>>> it's degenerating. Of course you could wrap this "resultset" in some
>>> more syntactic sugar, but essentially what you need to implement the
>>> client side API is to receive the single events.
>>> 
>>> I'm not focusing on the client side sugar because of Divya's original question:
>>> "a feasible path to achieve this functionality via some custom
>>> coding, even though it is not the most efficient path (because
>>> Continuous Queries are not available out of the box)."
>>> 
>>>> From a very different perspective, look at it in terms of a scalable
>>> architecture: when dealing with large amounts of data, the List
>>> interface is conceptually not cutting it; I would expect you to ban
>>> it, not to encourage it.
>>> Assuming the client is also designed as a a properly scalable system,
>>> if you were to provide it with a List this would likely need to
>>> iterate on it to forward each single element as a task to some
>>> parallel executor. It's much simpler if you push them one by one: it
>>> could still wrap each in a task, but you cut on the latency which you
>>> would otherwise introduce to collect all single items and you can
>>> allow users to insert a load balancer between your crazy scalable
>>> event generator and the target of these notifications.
>>> 
>>> (Because really if you setup such a feature on a large grid, it will
>>> be come a crazy scalable event generator)
>>> 
>>>>>>> You could register multiple such listeners, getting the effect of "newly
>>>>>>> stored entry X matches Query set {Q1, Q3, Q7}"
>>>>>> 
>>>>>> The listeners would not be collocated.
>>>>> 
>>>>> I'm not going to implement distributed listeners, I indeed expect you to register such a listener on each node.
>>>> 
>>>> If I run a query, continuous or not, I'd expect to be able to get all the result set of that query on the process on which I invoke it. Call me old fashion :-)
>>>> 
>>>>> 
>>>>> I can show how to make Continous Queries on the Query API to accomplish this.
>>>> 
>>>> I wouldn't name the problem your solution solve Continuous Query :-)
>>>> 
>>>>> Anything else is out of scope for me :-) Technically I think it's out of scope for Infinispan too, it should delegate to a message bus.
>>>> 
>>>> -1, for the reasons mentioned above.
>>>> 
>>>> [1] http://coherence.oracle.com/display/COH31UG/Continuous+Query
>>> 
>>> Do you realize this page is confirming a List is fundamentally wrong :-)
>>> it's listing a bunch of fallacies to explain common errors, which all
>>> boil down to an attempt of iterating on the entries, and then states:
>>> 
>>> "The solution is to provide the listener during construction, and it
>>> will receive one event for each item that is in the Continuous Query
>>> Cache, whether it was there to begin with (because it was in the
>>> query) or if it got added during or after the construction of the
>>> cache"
>>> 
>>> Finally, a consistency consideration on how to create such a list: if
>>> you get multiple events in short time, you'll never know which one is
>>> correct because of interleaving of the notifications.There is no way
>>> to iterate (search) a list of results in Infinispan in a consistent
>>> transactional view, unless you want me to lock all entries and repeat
>>> the query to confirm.
>> 
>> For many many users this getting a snapshot-result is good enough. After all this is how relational databases are queried.
>> 
>>> By NOT providing a List access, you avoid the
>>> problem of consistency and don't introduce contentions points like
>>> "aggregating it all in one placeholder".
>> 
>> Well Coherence supports both List(the CQ Cache itself) and event based, events being the preferred way when you don't want to miss any updated to the result set.
>> Also very important, the mechanism you described does't offer this consistency guarantee (e.g. between the time the user runs the query and he registers the listeners things might change).
> 
> That's what I said: you can't make a List in that time, but the event
> happened so it's fair to notify about it.
> 
>> Another (fundamental IMO) limitation that the approach we can offer has is the locality of the notifications:  the initial query executes the on node A and receives future notifications of other elements matching the query criteria on node B, C etc.
>> 
>>> Also interesting from Coherence's wiki: they have their results
>>> implement InvocableMap, essentially a representation of a conceptual
>>> data partition on which you can the invoke operations, by moving
>>> execution to the data. I think that's brilliant, and makes it quite
>>> clear that no such list is sent to the client.
>> 
>> Not really, the cache itself is the list :-)
> 
> That sounds very confusing to me, the cache is definitely not a list.
> If you mean to point out that it "represents" a local view of all
> data,

yes :-)

> that's fishy as it either contains a copy of all data (not nice
> when it's large)

Not if you only keep the set of keys locally and fetch the values (you might not even need them) on demand.

> or it's a proxy which will be extremely slow by
> "lazy-loading" each entry.

Indeed you might need to get the value based on the key with an RPC. I wouldn't call that as extremely slow, after all it's just a cache lookup.

> The InvocableMap approach sounds far more
> interesting in terms of locality.

It's still something that will go remotely on every invocation. If you need to do that very often(few 1000 times a sec), better to cache results locally.

> 
>> 
>> I don't think that with what we currently have we're that close to the CQ caches as the industry "defines" them. If this listener followed distributed notifications can be useful, then very good. I would refrain from marketing this as CQ support as would create false expectations.
> 
> Happy to not do it!

I don't think the query API extension you mention is critical here, as the filtering logic can be expressed directly in java (which might me actually more convenient/flexible).
Looking around, the CQ functionality that's missing in ISPN is:
- offer a way to receive all the notifications in the same VM
- offer a way to cache the result(might be keys only) in order to avoid executing the same query very often

Let's continue our chat on this next week ;)
 
Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)







More information about the infinispan-dev mailing list