[infinispan-dev] Infinispan and change data capture

Fri Dec 9 16:12:27 EST 2016

On 12/09/2016 05:08 PM, Randall Hauch wrote:
>
>> On Dec 9, 2016, at 3:13 AM, Radim Vansa <rvansa at redhat.com 
>> <mailto:rvansa at redhat.com>> wrote:
>>
>> On 12/08/2016 10:13 AM, Gustavo Fernandes wrote:
>>>
>>> I recently updated a proposal [1] based on several discussions we had
>>> in the past that is essentially about introducing an event storage
>>> mechanism (write ahead log) in order to improve reliability, failover
>>> and "replayability" for the remote listeners, any feedback greatly
>>> appreciated.
>>
>> Hi Gustavo,
>>
>> while I really like the pull-style architecture and reliable events, I
>> see some problematic parts here:
>>
>> 1) 'cache that would persist the events with a monotonically 
>> increasing id'
>>
>> I assume that you mean globally (for all entries) monotonous. How will
>> you obtain such ID? Currently, commands have unique IDs that are
>> <Address, Long> where the number part is monotonous per node. That's
>> easy to achieve. But introducing globally monotonous counter means that
>> there will be a single contention point. (you can introduce another
>> contention points by adding backups, but this is probably unnecessary as
>> you can find out the last id from the indexed cache data). Per-segment
>> monotonous would be probably more scalabe, though that increases 
>> complexity.
>
> It is complicated, but one way to do this is to have one “primary” 
> node maintain the log and to have other replicate from it. The cluster 
> does need to use consensus to agree which is the primary, and to know 
> which secondary becomes the primary if the primary is failing. 
> Consensus is not trivial, but JGroups Raft 
> (http://belaban.github.io/jgroups-raft/) may be an option. However, 
> this approach ensures that the replica logs are identical to the 
> primary since they are simply recording the primary’s log as-is. Of 
> course, another challenge is what happens during a failure of the 
> primary log node, and can any transactions be performed/completed 
> while the primary is unavailable.

I am not sure here if you propose to store all events in log on one 
node, use RAFT for the monotonic counter, or just for some node 
selection that will source the ids. In either case, you introduce a 
bottleneck - RAFT does not scale performance-wise, as any solution that 
uses single node for each operation, no matter how simple that operation is.

>
> Another option is to have each node maintain their own log, and to 
> have an aggregator log that merges/combines the various logs into one. 
> Not sure how feasible it is to merge logs by getting rid of duplicates 
> and determining a total order, but if it is then it may have better 
> fault tolerance characteristics.
>
> Of course, it is possible to have node-specific monotonic IDs. For 
> example, MySQL Global Transaction IDs (GTIDs) use a unique UUID for 
> each node, and then GTIDs consists of the node’s UUID plus a 
> monotonically-increasing value (e.g., 
> “31fc48cd-ecd4-46ad-b0a9-f515fc9497c4:1001”). The transaction log 
> contains a mix of GTIDs, and MySQL replication uses a “GTID set” to 
> describe the ranges of transactions known by a server (e.g., 
> “u1:1-100,u2:1-10000,u3:3-5” where “u1”, “u2”, and “u3” are actually 
> UUIDs). So, when a MySQL replica connects, it says “I know about this 
> GTID set", and this tells the master where that client wants to start 
> reading.

Yes, similar node + monotonous id is used both for transactions and for 
non-transactional commands in Infinispan. I would say that in 
complexity, it's similar to per-segment counters, but so far we have a 
constant number of segments as opposed to varying number of nodes.

Node-specific monotonic ids do not give you monotonic order of commits, 
just unique ids: If a NodeA does operation 1 and 2, this does not say 
that 1 will be comitted before 2; 2 can be finished (and pushed to log) 
before 1. But I don't think you really need a monotonic sequence. In 
Infinispan, all the nodes should push the events in the same order, 
though, so the log will know where to start from if a client asks for 
all messages after op 1. As long as duplicates are properly filtered out.

>
>>
>> 2) 'The write to the event log would be async in order to not affect
>> normal data writes'
>>
>> Who should write to the cache?
>> a) originator - what if originator crashes (despite the change has been
>> added)? Besides, originator would have to do (async) RPC to primary
>> owner (which will be the primary owner of the event, too).
>> b) primary owner - with triangle, primary does not really know if the
>> change has been written on backup. Piggybacking that info won't be
>> trivial - we don't want to send another message explicitly. But even if
>> we get the confirmation, since the write to event cache is async, if the
>> primary owner crashes before replicating the event to backup, we lost
>> the event
>> c) all owners, but locally - that will require more complex
>> reconciliation if the event did really happen on all surviving nodes or
>> not. And backups could have some trouble to resolve order, too.
>>
>> IIUC clustered listeners are called from primary owner before the change
>> is really confirmed on backups (@Pedro correct me if I am wrong,
>> please), but for this reliable event cache you need higher level of
>> consistency.
>
> This could be handled by writing a confirmation or “commit” event to 
> the log when the write is confirmed or the transaction is committed. 
> Then, only those confirmed events/transactions would be exposed to 
> client listeners. This requires some buffering, but this could be done 
> in each HotRod client.

I would put this under "originator". So, if the node that writes the 
"commit" event crashes, the data is changed (and consistent) in the 
cluster but nobody will be notified about that.
Note that Infinispan does not guarantee that data being written by a 
crashing node will end up consistent on all owners, because it is the 
originator who retries the operation if one of the owners crashed (or 
generally, when a topology changes during the command). So it's not that 
bad solution after all, if you're okay by missing an effectively 
committed operation on node crash.

>
>>
>> 3) The log will also have to filter out retried operations (based on
>> command ID - though this can be indexed, too). Though, I would prefer to
>> see per-event command-id log to deal with retries properly.
>
> IIUC, a “commit” event would work here, too.
>
>>
>> 4) Client should pull data, but I would keep push notifications that
>> 'something happened' (throttled on server). There could be use case for
>> rarely updated caches, and polling the servers would be excessive there.
>
> IMO the clients should poll, but if the server has nothing to return 
> it blocks until there is something or until a timeout occurs. This 
> makes it easy for clients and actually reduces network traffic 
> compared to constantly polling.

I would say that client waiting on a blocked connection is a push (maybe 
there's a method to implement push otherwise on TCP connection but I am 
not aware of it - please forgive my ignorance).

>
> BTW, a lot of this is replicating the functionality of Kafka, which is 
> already quite mature and feature rich. It’s actually possible to 
> *embed* Kafka to simplify operations, but I don’t think that’s 
> recommended. And, it introduces a very complex codebase that would 
> need to be supported.

I wouldn't use complex third party project on a similar tier as 
JGroups/Infinispan to implement basic functionality (which remote 
listeners are), but for Debezium it could be a fit. Let's discuss your 
Kafka based proposal in the follow-up mail thread.

R.

>
>>
>> Radim
>>
>>>
>>>
>>> [1]
>>> https://github.com/infinispan/infinispan/wiki/Remote-Listeners-improvement-proposal
>>>
>>> Thanks,
>>> Gustavo
>>>
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> -- 
>> Radim Vansa <rvansa at redhat.com <mailto:rvansa at redhat.com>>
>> JBoss Performance Team
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org <mailto:infinispan-dev at lists.jboss.org>
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Radim Vansa <rvansa at redhat.com>
JBoss Performance Team