On 12/09/2016 05:08 PM, Randall Hauch wrote:
> On Dec 9, 2016, at 3:13 AM, Radim Vansa <rvansa(a)redhat.com
> <mailto:rvansa@redhat.com>> wrote:
>
> On 12/08/2016 10:13 AM, Gustavo Fernandes wrote:
>>
>> I recently updated a proposal [1] based on several discussions we had
>> in the past that is essentially about introducing an event storage
>> mechanism (write ahead log) in order to improve reliability, failover
>> and "replayability" for the remote listeners, any feedback greatly
>> appreciated.
>
> Hi Gustavo,
>
> while I really like the pull-style architecture and reliable events, I
> see some problematic parts here:
>
> 1) 'cache that would persist the events with a monotonically
> increasing id'
>
> I assume that you mean globally (for all entries) monotonous. How will
> you obtain such ID? Currently, commands have unique IDs that are
> <Address, Long> where the number part is monotonous per node. That's
> easy to achieve. But introducing globally monotonous counter means that
> there will be a single contention point. (you can introduce another
> contention points by adding backups, but this is probably unnecessary as
> you can find out the last id from the indexed cache data). Per-segment
> monotonous would be probably more scalabe, though that increases
> complexity.
It is complicated, but one way to do this is to have one “primary”
node maintain the log and to have other replicate from it. The cluster
does need to use consensus to agree which is the primary, and to know
which secondary becomes the primary if the primary is failing.
Consensus is not trivial, but JGroups Raft
(
http://belaban.github.io/jgroups-raft/) may be an option. However,
this approach ensures that the replica logs are identical to the
primary since they are simply recording the primary’s log as-is. Of
course, another challenge is what happens during a failure of the
primary log node, and can any transactions be performed/completed
while the primary is unavailable.
I am not sure here if you propose to store all events in log on one
node, use RAFT for the monotonic counter, or just for some node
selection that will source the ids. In either case, you introduce a
bottleneck - RAFT does not scale performance-wise, as any solution that
uses single node for each operation, no matter how simple that operation is.
Another option is to have each node maintain their own log, and to
have an aggregator log that merges/combines the various logs into one.
Not sure how feasible it is to merge logs by getting rid of duplicates
and determining a total order, but if it is then it may have better
fault tolerance characteristics.
Of course, it is possible to have node-specific monotonic IDs. For
example, MySQL Global Transaction IDs (GTIDs) use a unique UUID for
each node, and then GTIDs consists of the node’s UUID plus a
monotonically-increasing value (e.g.,
“31fc48cd-ecd4-46ad-b0a9-f515fc9497c4:1001”). The transaction log
contains a mix of GTIDs, and MySQL replication uses a “GTID set” to
describe the ranges of transactions known by a server (e.g.,
“u1:1-100,u2:1-10000,u3:3-5” where “u1”, “u2”, and “u3” are actually
UUIDs). So, when a MySQL replica connects, it says “I know about this
GTID set", and this tells the master where that client wants to start
reading.
Yes, similar node + monotonous id is used both for transactions and for
non-transactional commands in Infinispan. I would say that in
complexity, it's similar to per-segment counters, but so far we have a
constant number of segments as opposed to varying number of nodes.
Node-specific monotonic ids do not give you monotonic order of commits,
just unique ids: If a NodeA does operation 1 and 2, this does not say
that 1 will be comitted before 2; 2 can be finished (and pushed to log)
before 1. But I don't think you really need a monotonic sequence. In
Infinispan, all the nodes should push the events in the same order,
though, so the log will know where to start from if a client asks for
all messages after op 1. As long as duplicates are properly filtered out.
>
> 2) 'The write to the event log would be async in order to not affect
> normal data writes'
>
> Who should write to the cache?
> a) originator - what if originator crashes (despite the change has been
> added)? Besides, originator would have to do (async) RPC to primary
> owner (which will be the primary owner of the event, too).
> b) primary owner - with triangle, primary does not really know if the
> change has been written on backup. Piggybacking that info won't be
> trivial - we don't want to send another message explicitly. But even if
> we get the confirmation, since the write to event cache is async, if the
> primary owner crashes before replicating the event to backup, we lost
> the event
> c) all owners, but locally - that will require more complex
> reconciliation if the event did really happen on all surviving nodes or
> not. And backups could have some trouble to resolve order, too.
>
> IIUC clustered listeners are called from primary owner before the change
> is really confirmed on backups (@Pedro correct me if I am wrong,
> please), but for this reliable event cache you need higher level of
> consistency.
This could be handled by writing a confirmation or “commit” event to
the log when the write is confirmed or the transaction is committed.
Then, only those confirmed events/transactions would be exposed to
client listeners. This requires some buffering, but this could be done
in each HotRod client.
I would put this under "originator". So, if the node that writes the
"commit" event crashes, the data is changed (and consistent) in the
cluster but nobody will be notified about that.
Note that Infinispan does not guarantee that data being written by a
crashing node will end up consistent on all owners, because it is the
originator who retries the operation if one of the owners crashed (or
generally, when a topology changes during the command). So it's not that
bad solution after all, if you're okay by missing an effectively
committed operation on node crash.
>
> 3) The log will also have to filter out retried operations (based on
> command ID - though this can be indexed, too). Though, I would prefer to
> see per-event command-id log to deal with retries properly.
IIUC, a “commit” event would work here, too.
>
> 4) Client should pull data, but I would keep push notifications that
> 'something happened' (throttled on server). There could be use case for
> rarely updated caches, and polling the servers would be excessive there.
IMO the clients should poll, but if the server has nothing to return
it blocks until there is something or until a timeout occurs. This
makes it easy for clients and actually reduces network traffic
compared to constantly polling.
I would say that client waiting on a blocked connection is a push (maybe
there's a method to implement push otherwise on TCP connection but I am
not aware of it - please forgive my ignorance).
BTW, a lot of this is replicating the functionality of Kafka, which is
already quite mature and feature rich. It’s actually possible to
*embed* Kafka to simplify operations, but I don’t think that’s
recommended. And, it introduces a very complex codebase that would
need to be supported.
I wouldn't use complex third party project on a similar tier as
JGroups/Infinispan to implement basic functionality (which remote
listeners are), but for Debezium it could be a fit. Let's discuss your
Kafka based proposal in the follow-up mail thread.
R.
>
> Radim
>
>>
>>
>> [1]
>>
https://github.com/infinispan/infinispan/wiki/Remote-Listeners-improvemen...
>>
>> Thanks,
>> Gustavo
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Radim Vansa <rvansa(a)redhat.com <mailto:rvansa@redhat.com>>
> JBoss Performance Team
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org <mailto:infinispan-dev@lists.jboss.org>
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Radim Vansa <rvansa(a)redhat.com>
JBoss Performance Team