[infinispan-dev] Infinispan and change data capture

Thu Dec 15 10:09:07 EST 2016

Thanks Randall,
those clarifications have been great.

Emmanuel: some of your statements conflict with Randall's
clarifications and with the feasibility points I've been pointing at.
You say "collect *all* changes". I've been questioning that Infinispan
can not keep *all* changes around for a given single key; I understand
we'd allow clients to retrieve streams of changes persisted into
Kafka, but we need to be clear that we won't be handling *all* changes
to Kafka (nor to Debezium), so the magic these can do is somewhat
limited. They can certainly expand on the capabilities that Infinispan
would provide on its own, but some of the use cases which Gustavo
mentioned would not be suitable.

I don't think this is a big problem in practice though; take the
example of monitoring fluctuations of value of some stock symbol for
example: it wouldn't be possible to investigate derivative numbers
from these fluctuations just from the Key/Value pair "stock name" /
"value", however people can store such events in a different way, for
example by using a composite key "stock name" + "timestamp". People
just need clarity on how this works, including us to model the storage
appropriately.

Thanks,
Sanne

On 15 December 2016 at 09:54, Emmanuel Bernard <emmanuel at hibernate.org> wrote:
> The goal is as followed: allow to collect all changes to push them to Debezium and thus Kafka.
>
> This need does not require to remember all changes since the beginning of time in Infinispan. Just enough to:
> - let Kafka catchup assuming it is the bottleneck
> - let us not lose a change in Kafka when it happened in Infinispan (coordinator, owner, replicas dying)
>
> The ability to read back history would then be handled by the Debezium / Kafka tail, not infinispan itself.
>
> Check my email on this tread from Dec 9th.
>
>> On 12 Dec 2016, at 16:13, Sanne Grinovero <sanne at infinispan.org> wrote:
>>
>> I'm reading many clever suggestions for various aspects of such a
>> system, but I fail to see a clear definition of the goal.
>>
>>> From Randall's opening email I understand how MySQL does this, but
>> it's an example and I'm not sure which aspects are implementation
>> details of how MySQL happens to accomplish this, and which aspects are
>> requirements for the Infinispan enhancement proposals.
>>
>> I remember a meeting with Manik Surtani, Jonathan Halliday and Mark
>> Little, whose outcome was a general agreement that Infinispan would
>> eventually need both tombstones and versioned entries, not just for
>> change data capture but to improve several other aspects;
>> unfortunately that was in December 2010 and never became a priority,
>> but the benefits are clear.
>> The complexities which have put off such plans lie in the "garbage
>> collection", aka the need to not grow the history without bounds, and
>> have to drop or compact history.
>>
>> So I'm definitely sold on the need to add a certain amount of history,
>> but we need to define how much of this history is expected to be held.
>>
>> In short, what's the ultimate goal? I see two main but different
>> options intertwined:
>> - allow to synchronize the *final state* of a replica
>> - inspect specific changes
>>
>> For the first case, it would be enough for us to be able to provide a
>> "squashed history" (as in Git squash), but we'd need to keep versioned
>> shapshots around and someone needs to tell you which ones can be
>> garbage collected.
>> For example when a key is: written, updated, updated, deleted since
>> the snapshot, we'll send only "deleted" as the intermediary states are
>> irrelevant.
>> For the second case, say the goal is to inspect fluctuations of price
>> variations of some item, then the intermediary states are not
>> irrelevant.
>>
>> Which one will we want to solve? Both?
>> Personally the attempt of solving the second one seems like a huge
>> pivot of the project, the current data-structures and storage are not
>> designed for this. I see the value of such benefits, but maybe
>> Infinispan is not the right tool for such a problem.
>>
>> I'd prefer to focus on the benefits of the squashed history, and have
>> versioned entries soon, but even in that case we need to define which
>> versions need to be kept around, and how garbage collection /
>> vacuuming is handled.
>> This can be designed to be transparent to the client: handled as an
>> internal implementation detail which we use to improve performance of
>> Infinispan itself, or it can be exposed to clients to implement change
>> data capture, but in this case we need to track which clients are
>> still going to need an older snapshot; this has an impact as clients
>> would need to be registered, and has a significant impact on the
>> storage strategies.
>>
>> Within Kafka the log compaction strategies are configurable; I have no
>> experience with Kafka but the docs seem to suggest that it's most
>> often used to provide the last known value of each key. That would be
>> doable for us, but Kafka also does allow optionally for wider scope
>> retention strategies: can we agree that that would not be an option
>> with Infinispan? If not, these goals need to be clarified.
>>
>> My main concern is that if we don't limit the scope of which
>> capabilities we want Infinispan to provide, it risks to become the
>> same thing as Kafka, rather than integrating with it. I don't think we
>> want to pivot all our storage design into efficiently treating large
>> scale logs.
>>
>> In short, I'd like to see an agreement that analyzing e.g.
>> fluctuations in stock prices would be a non-goal, if these are stored
>> as {"stock name", value} key/value pairs. One could still implement
>> such a thing by using a more sophisticated model, just don't expect to
>> be able to see all intermediary values each entry has ever had since
>> the key was first used.
>>
>> # Commenting on specific proposals
>>
>> On ID generation: I'd definitely go with IDs per segment rather than
>> IDs per key for the purpose of change data capture. If you go with
>> independent IDs per key, the client would need to keep track of each
>> version of each entry, which has an high overhead and degree of
>> complexity for the clients.
>> On the other hand, we already guarantee that each segment is managed
>> by a single primary owner, so attaching the "segment transaction id"
>> to each internal entry being changed can be implemented efficiently by
>> Infinispan.
>> Segment ownership handoff needs to be highly consistent during cluster
>> topology changes, but that requirement already exists; we'd just need
>> to make sure that this monotonic counter is included during the
>> handoff of the responsibility as primary owner of a segment.
>>
>> Thanks,
>>
>> Sanne
>>
>>
>>
>>
>>
>> On 12 December 2016 at 13:58, Gustavo Fernandes <gustavo at infinispan.org> wrote:
>>>
>>>
>>> On Fri, Dec 9, 2016 at 9:13 AM, Radim Vansa <rvansa at redhat.com> wrote:
>>>>
>>>> But introducing globally monotonous counter means that
>>>> there will be a single contention point.
>>>
>>>
>>> I wonder if the trade-off of Flake Ids [1] could be acceptable for this use
>>> case.
>>>
>>> [1] http://yellerapp.com/posts/2015-02-09-flake-ids.html
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev