[infinispan-dev] Infinispan and change data capture

Wed Dec 14 12:21:58 EST 2016

> On Dec 14, 2016, at 7:58 AM, Sanne Grinovero <sanne at infinispan.org> wrote:
> 
> On 12 December 2016 at 17:56, Gustavo Fernandes <gustavo at infinispan.org <mailto:gustavo at infinispan.org>> wrote:
>> On Mon, Dec 12, 2016 at 3:13 PM, Sanne Grinovero <sanne at infinispan.org>
>> wrote:
>> 
>>> 
>>> In short, what's the ultimate goal? I see two main but different
>>> options intertwined:
>>> - allow to synchronize the *final state* of a replica
>> 
>> 
>> I'm assuming this case is already in place when using remote listeners and
>> includeCurrentState=true and we are
>> discussing how to improve it, as described in the proposal in the wiki and
>> on the 5th email of this thread.
>> 
>>> 
>>> - inspect specific changes
>>> 
>>> For the first case, it would be enough for us to be able to provide a
>>> "squashed history" (as in Git squash), but we'd need to keep versioned
>>> shapshots around and someone needs to tell you which ones can be
>>> garbage collected.
>>> For example when a key is: written, updated, updated, deleted since
>>> the snapshot, we'll send only "deleted" as the intermediary states are
>>> irrelevant.
>>> For the second case, say the goal is to inspect fluctuations of price
>>> variations of some item, then the intermediary states are not
>>> irrelevant.
>>> 
>>> Which one will we want to solve? Both?
>> 
>> 
>> 
>> Looking at http://debezium.io/, it implies the second case.
> 
> That's what I'm asking which needs to be clarified.
> 
> If it's the second case, then while I appreciate the value of such a
> system I don't see it as a good fit for Infinispan.

If Infinispan were to allow a client to consume (within a reasonable amount of time) an event for every change, then Debezium would certainly then be able to capture these into a stream that is persisted for a much longer period of time.

OTOH, I think it’s reasonable for Infinispan to squash history as long as this doesn’t reorder changes and at least the last change is kept. Debezium can still work with this.

> 
>> 
>> "[...] Start it up, point it at your databases, and your apps can start
>> responding to all of the inserts, updates,
>> and deletes that other apps commit to your databases. [...] your apps can
>> respond quickly and never miss an event,
>> even when things go wrong."
>> 
>> IMO the choice between squashed/full history, and even retention time is
>> highly application specific. Deletes might
>> not even be involved, one may be interested on answering "what is the peak
>> value of a certain key during the day?"
> 
> Absolutely. And Infinispan might need to draw a line and clarify which
> problems it is meant to solve, and which problems are better solved
> with a different solution.

+1. Just be clear in what the listeners will see and what they won’t see.

And I guess we need to clarify what “never miss an event” means for Debezium: we capture every event that a source system exposes to us and will not lose any of them, but if using Kafka compaction then when replaying you’re guaranteed to see at least the most recent change for every key.

> 
> 
>>> Personally the attempt of solving the second one seems like a huge
>>> pivot of the project, the current data-structures and storage are not
>>> designed for this.
>> 
>> 
>> +1, as I wrote earlier about ditching the idea of event cache storage in
>> favor of Lucene.
> 
> Yes that's a great idea, but I'd like to discuss first were we want to get.
> 
>>> I see the value of such benefits, but maybe
>>> Infinispan is not the right tool for such a problem.
>>> 
>>> I'd prefer to focus on the benefits of the squashed history, and have
>>> versioned entries soon, but even in that case we need to define which
>>> versions need to be kept around, and how garbage collection /
>>> vacuuming is handled.
>> 
>> 
>> Is that proposal written/recorded somewhere? It'd be interesting to know how
>> a client interested on data
>> changes would consume those multi-versioned entries (push/pull with offset?,
>> sorted/unsorted?, client tracking/per key/per version?),
>> as it seems there is some storage impedance as well.
>> 
>>> 
>>> 
>>> In short, I'd like to see an agreement that analyzing e.g.
>>> fluctuations in stock prices would be a non-goal, if these are stored
>>> as {"stock name", value} key/value pairs. One could still implement
>>> such a thing by using a more sophisticated model, just don't expect to
>>> be able to see all intermediary values each entry has ever had since
>>> the key was first used.
>> 
>> 
>> 
>> Continuous Queries listens to data key/value data using a query, should it
>> not be expected to
>> see all the intermediary values when changes in the server causes an entry
>> to start/stop matching
>> the query?
> 
> That's exactly the doubt I'm raising: I'm not sure we set that
> expectations, and if we did then I don't agree with that choice, and I
> remember voicing concerns on feasibility of such aspects of CQ during
> early design.
> I might be a minority, but whatever the decision was I don't think
> this is now clear nor properly documented.
> 
> If one needs to store a significant sequence of values, organised by
> "keys" (aka partitions), that pretty much suggests the need for Kafka
> itself, rather than an integration with Kafka, or perhaps depending on
> the use case a time-series database.
> 
> Kafka is more evolved in this area, and yet even in that case I'm
> confident that an unbounded history would not be a reasonable
> expectation; Kafka's however treats the managing of such boundaries -
> and history compaction policies - as first class concepts both on APIs
> and integration / extension points.
> 
> That's not to say we don't need any form of history; we discussed
> loads of improved protocols over the years which would benefit from
> versioned entries and tombstones, but we've always assumed to manage
> the control of "history boundaries" and compaction strategies as
> internal implementation details, at most to help defining ordering of
> operations but never promising to expose a fine grained representation
> of all versions an entry had within a range.
> 
> BTW I'm not at all against integrating with Debezium, that looks like
> a very good idea. Just checking if we can agree on the limitations
> this should have, so we can clearly describe this feature: when it's
> useful, and when it's not.
> 
> Thanks,
> Sanne
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org <mailto:infinispan-dev at lists.jboss.org>
> https://lists.jboss.org/mailman/listinfo/infinispan-dev <https://lists.jboss.org/mailman/listinfo/infinispan-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20161214/ae83934b/attachment-0001.html