Re: [infinispan-dev] Design change in Infinispan Query

Tuesday, 4 March 2014

My impression is that this thread is boiling down to an identity
problem of the project.

It is a key/value store with some properties which make it a great fit
for a Cache; I don't think there are doubts that relational queries
are useful to have to resolve a certain category of problems, and no
doubt someone will like to do that on Infinispan, but it should be
stated clearly that this is beyond the scope of the project: we're not
in the business of relational queries, because the fundamental
building blocks are not fit for it. So this should be a clear no-go,
as it should be clear for many other features which will be requested
over time: just because we could implement some part doesn't make the
project any better.

That said, it is possible for us to provide some additional features
which are a good fit to build on top of these fundamental building
blocks, and also happen to make the fundamental API much more useful.
I'm referring to the simple queries we have today, the various
remoting protocols, management extensions, better eviction policies,..
So I'm not questioning many of the nice features we have today, but
it's a dangerous game if the limitations of what it can or can not do
start to blur.

To make an example, it was four years ago that I suggested that
Infinispan should not expose the full Map interface without actually
implementing it in its full glory, and people regularly bang their
head against it. The limitations are clear in the javadoc, but still
the type is compatible with Map and so sometimes the client wasn't
written with Infinispan in mind: just an example, but it's dangerous
to expose a contract which we can't maintain in all its intricate
complex forms.

To quote Emmanuel: "So back to the root of that discussion now, what
was the use case(s) that lead to believe we need cross-cache query
support?"

The intent I had in my original email was having several bullet
points, and none are related to use cases but rather to internal
efficiency concerns regarding sharing Search services, bootstrap time,
configuration simplifications. The configuration one used to be quite
painful but that improved recently, and for most of these points we
can find alternative or even better solutions.

I even have additional reasons to not do it: as previously discussed,
Map/Reduce tasks should be able to output results to a different
cache. That's reasonable as you need to "park" a potentially large
output somewhere, but also doesn't imply cross-cache M/R search, which
would be a requirement for this feature to be consistent. If any
updates on different Caches - and therefore its content - should be
considered highly unrelated: might be using a different distribution,
different cachestore, a different Transaction. If any kind of
soft-join was "implied" across caches, we would never be able to
maintain consistency, a causality order (in future), a consistent
backup/snapshot, etc.. a nightmare for users and for us to firefight
"bugs".

So my conclusion is that it's dangerous, confusing, and also complex
to achieve: we won't implement it.

Sanne

On 4 March 2014 18:02, Emmanuel Bernard <emmanuel(a)hibernate.org&gt; wrote:
...

 On 28 Feb 2014, at 22:14, Mircea Markus <mmarkus(a)redhat.com&gt; wrote:

 On Feb 26, 2014, at 5:14 PM, Emmanuel Bernard <emmanuel(a)hibernate.org&gt;
 wrote:

 On 25 Feb 2014, at 16:08, Mircea Markus <mmarkus(a)redhat.com&gt; wrote:

 On Feb 25, 2014, at 9:28 AM, Emmanuel Bernard <emmanuel(a)hibernate.org&gt;
 wrote:

 On 24 févr. 2014, at 17:39, Mircea Markus <mmarkus(a)redhat.com&gt; wrote:

 On Feb 17, 2014, at 10:13 PM, Emmanuel Bernard <emmanuel(a)hibernate.org&gt;
 wrote:

 By the way, Mircea, Sanne and I had quite a long discussion about this one
 and the idea of one cache per entity. It turns out that the right (as in
 easy) solution does involve a higher level programming model like OGM
 provides. You can simulate it yourself using the Infinispan APIs but it is
 just cumbersome.

 Curious to hear the whole story :-)
 We cannot mandate all the suers to use OGM though, one of the reasons being
 OGM is not platform independent (hotrod).

 Then solve all the issues I have raised with a magic wand and come back to
 me when you have done it, I'm interested.

 People are going to use infinispan with one cache per entity, because it
 makes sense:
 - different config (repl/dist | persistent/non-persistent) for different
 data types
 - have map/reduce tasks running only the Person entires not on Dog as well,
 when you want to select (Person) where age > 18
 I don't see a reason to forbid this, on the contrary. The way I see it the
 relation between (OGM, ISPN) <=> (Hibernate, JDBC). Indeed OGM would be a
 better abstraction and should be recommended as such for the Java clients,
 but ultimately we're a general purpose storage engine that is available to
 different platforms as well.

 I do disagree on your assessment.
 I did write a whole essay on why I think your view is problematic - I was
 getting tired of repeating myself ;P

https://github.com/infinispan/infinispan/wiki/A-continuum-of-data-structu...

 Thanks for writing this up, it is a good taxonomy of data storage schemes
 and querying.

 To anecdotally answer your specific example, yes different configs for
 different entities is an interesting benefit but it has to outweigh the
 drawbacks.

 Using a single cache for all the types is practical at all :-) Just to
 expand my idea, people prefer using different caches for many reasons:
 - security: Account cache has a different security requirements than the
 News cache
 - data consistency: News is a non-transactional cache, Account require
 pessimistic XA transactions
 - expiry: expire last year's news from the system. Not the same for Accounts
 - availability: I want the Accounts cache to be backed up to another site. I
 don't want that for the News cache
 - logical data grouping: mixing Accounts with News doesn't make sense. I
 might want to know which account appeared in the news, though.

 This kind of reasons reminds me in the RDBMS world of why people use
 different databases.
 In fact, I have had experience where literally News was a different database
 than Accounts.

 But again in this model, in one database, you have many tables.

 If you have to do a map reduce for tasks so simple as age > 18, I think you
 system better have to be prepared to run gazillions of M/R jobs.

 I want to run a simple M/R job in the evening to determine who turns 18
 tomorrow, to congratulate them. Once a day, not gazzilions of times, and I
 don't need to index the age filed just for that. Also when it comes to
 Map/Reduce, the drawback of holding all the data in a single cache is
 two-folded:
 - performance: you iterate over the data that is not related to your query.

 If the data are never related (query wise), then we are in the database
 split category. Which is fine. But if some of your queries are related, what
 do you do? Deny the user the ability to do them?

 - programming model: the Map/Reduce implementation has a dependency on both
 Dog and Person. If I add Cats to the cache, I'll need to update the M/R code
 to be aware of that as well. Same if I rename/remove Dog. Not nice.

 Well it’s called type safety, some people find it good ;)
 By the way, OGM does abstract a class from it’s representation in the
 datastore (including its name). But that’s another story ;)

 I think that Dogs and any domestic animal is fundamentally related to humans
 - Person in your case. So queries involving both will be required - a cross
 cache M/R is not doable today AFAIK and even if it was, it’s still M/R and
 all its drawbacks.
 To me, the Cache API and Hot Rod are well suited for what I call self
 contained object graph (i.e. where Dog would be an embedded object of Person
 and not a separate Entity). In that situation, there is a single cache.

 I see where you come from but I don't think requiring people to use a single
 cache for all the entities is an option. Besides a natural logical
 separation, different data has different storage requirements: security,
 access patterns, consistency, durability, availability etc. For most of the
 non-trivial use cases, using a single cache just wont do.

 Let me rephrase and sum up my position.
 If you are storing unrelated data, use different caches if you want, that’s
 fine.
 If you are storing related data, store it as one root entity and embeddable
 objects (ie one cache entry for the whole graph)
     you can have one root entity per cache, that’s fine.
 If you are storing related entities and want to do queries on it: you are
 more or less screwed today with Infinispan and need a higher level
 abstraction.

 So _recommending_ one entity = one cache to me is wrong. It’s more one
 entity graph = one cache which is vastly different and has deep consequences
 (see my wiki page).

 One cache per entity does make sense for API that do support what I call
 connected entities. Hibernate OGM specifically.

 OGM does a great job covering this, but it is very specific: java only and
 OOP - our C/S mode, hotrod specifically, is language independent and not
 OOP.

 Sure. Outside of Java consider that connected objects are not a supported
 feature of Infinispan. When I mean not supported, I mean:
 - no inter object query
 - no automatically handled relation between objects

 Also I would like to comment on the following statements:
 "I believe a cache API and Hot Rod are well suited to address up to the self
 contained object graph use case with a couple of relations maintained
 manually by the application but that cannot be queried. For the connected
 entities use case, only a high level paradigm is suited like JPA."

 I don't think storing object graphs should be under scrutiny here:
 Infinispan C/S mode (and there's where most of the client focus is BTW) has
 a schema (prtobuf) that does not support object graphs. I also think
 expecting people to use multiple caches for multiple data types is a solid
 assumption to start from. And here's me speculating: these data types have
 logical relations between them so people will ask for querying. In order to
 queries on multiple data types, you can either merge them together (your
 suggestion) or support some sort of new cross-cache indexing/querying/api.
 x-cache querying is more flexible and less restraining than merging data,
 but from what I understand from you has certain implementation challenges.
 There's no pressure to take a decision now around supporting queries
 spreading multiple caches - just something to keep an eye on when dealing
 with use cases/users. ATM merging data is the only solution available, let's
 wait and see if people ask for more.

 That’s fine I guess. My problem is that nowhere in your documentations do
 you guys list what can and cannot be done in that regard. And when you call
 a data structure an entity, it comes with an implicit bagage: relation
 between entities, polymorphism etc.
 So this needs to be clarified.

 Now circling back to the main topic. Doing cross-cache query on non related
 data (as they are in different caches) might be useful in some situations
 but is generally not interesting as we specifically don’t support joins
 (they are not related).
 So back to the root of that discussion now, what was the use case(s) that
 lead to believe we need cross-cache query support?

 Emmanuel

 _______________________________________________
 infinispan-dev mailing list
 infinispan-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] Design change in Infinispan Query