On 1 Nov 2010, at 21:36, Paolo Romano wrote:
On 11/1/10 4:36 PM, Manik Surtani wrote:
> On 27 Oct 2010, at 09:43, Paolo Romano wrote:
>
>> Hi Manik,
>>
>> up to date my group has been studying how to fit Atomic Broadcast (AB) based
replication mechanisms within the existing 1-phase/2-phases commit schemes of Infinispan,
without altering them. In principle this seems possible, though we will find it out only
when we advance with the development.
> Ok, great. Let us know how you get on, or if you need any help.
Sure!
>> If we found any roadblock, we'll let you know and try to find some more
generic interface that allows to encapsulate both the current 2PC mechanisms and the
alternative replication schemes that we intend to develop. In the meanwhile, sticking with
the current interfaces seems less intrusive and would allow us to get acquainted with the
current code base.
>>
>> Specifically, our ideas here are:
>> - for fully replicated system, no distribution. Rather than using 2PC, we could
use the 1PC, with the commit message being AB rather than simply broadcast. This message
would transmit the set of items written by the current xact. Upon delivery of the AB, each
node should validate the transaction writeset. This in our current systems is done by
timestamping each transaction as it starts with an integer that is incremented whenever a
write transaction commits. So when a transaction commits, we just check if any of the
items it wrote has been updated by a transaction having a timestamp larger than the one
the current transaction had when it started. We took a quick look to Infinispan's MVCC
implementation, and we got the impression that currently there isn't an analogous
mechanism. Is it correct?
> Correct. However you could extend the CacheEntry to contain a Version field which is
a transient atomic integer to be updated each time. This would mean that each entry is
independently versioned though. I presume this works for you?
Yes it would. Actually we have a few comments on the current MVCC
implementation that we'd like to share with you. In the Cloud-TM
discussion mailing list there have been a number of remarks on that it
could be desirable to have stronger consistency guarantees than
repeatable read (snapshot isolation or serializability). But let's
proceed in steps. Let's start integrating Atomic Broadcast based
protocols adapting them to ensure the current isolation guarantee
(repeatable read) first!
Yes, I owe you some responses on that mail list. Sorry for the slow response, still
dealing with email backlog here. :-)
>> As a side note, the protocols we presented in Lisbon ensure serializability, so
they need to deal with the issue of disseminating transactions' readsets across nodes.
As encoding transactions readsets typically implies generating very large messages, we
have recently proposed a replication scheme that allows to significantly reduce the amount
of information exchanged by encoding the readset in a Bloom Filter.
> Yes, I did make a note of this rather interesting approach.
>
>> On the other hand, by providing repeatable read, and tracking only write-write
conflicts, Infinispan avoids this kind of issue a priori.
>> Now, I am not entirely sure if it would make sense to extend Infinispan within
the Cloud-TM project to provide supports for serializability. But if we opt to do so, it
would be interesting to integrate this technique as well.
> Yes, but I think this should be treated with lower priority than atomic broadcasts.
I agree with you, see above.
>> - for partially replicated system. This is where 2PC would be utilized. The
simplest scheme that one could use here would be the following (we have come up with a
new, more complex protocol, but we prefer to advance by small steps implementing a simpler
one). During the first phase the coordinator would do an Atomic Multicast (AM) to the
other transaction's participants. Upon delivery of the AM by a node "n", the
data accessed by the transaction and stored by "n" would be locally validated.
Note that all replicas of a data would deliver the coordinator message in the same order.
Thus validation would give the same output at all replicas. Also the mechanism would be
deadlock free. Now there are two options depending on whether we want to have a
decentralized or centralized scheme.
>> a) each participant multicasts (plain) to all other participants what is the
outcome of the local validation phase. As soon as we collect a negative vote, we can abort
straightforwardly. Otherwise, as soon as a node gathers a positive vote from (at least)
one replica of each data item accessed by the xact, it can commit.
>> b) the participants send to the coordinator the outcome of the local
validation phase. The coordinator then would behave, like in classic 2PC.
>> In case a) the number of exchanged messages would be quadratic in the number of
transaction participants, but the commit latency would be that of an AM plus a multicast.
In case b) the number of exchanged messages would be linear in the number of transaction
participants, but the commit latency would be that of an AM plus 2 communication steps
(one to deliver the vote to the coordinator, one for the coordinator to communicate the
decision to the participants).
>> Note that in case a) we would totally skip the second cycle of the 2PC (unless we
are missing something this should be feasible by handling this protocol as a special case
in the interceptors' chain).
>> This protocol (variant a) was actually presented in [1], if you want to have more
details.
> This sounds interesting. With regards to a centralized approach, I presume you mean
central in the context of any given tx? E.g., tx1 may be centralized on node A while tx2
is centralized on node B?
Yes, it's just the vote collection phase which could be centralized
(i.e. managed by the node that ran the transaction, just like it happens
with the coordinator of 2PC) or distributed.
Good, as I hoped.
>
>> Note that both approaches are deadlock-free, as the transaction serialization
order is imposed by the order determined by the Atomic Broadcast. The cost to implement
Atomic Broadcast depends on the precise guarantees you want to provide (e.g. upon failure
of a node, should the system block until he recovers? Note that this is what you get
typically with 2PC), and on the specific protocol that you use. The fastest (in terms of
latency) Atomic Broadcast protocols are those based on a process, called sequencer, whose
role is to sequence messages. In this case, an extra communication step (+1 log on the
sequencer side) would be required in order to obtain the serialization number from the
sequencer.
> I presume the sequencer is a singleton service in the cluster. Would this become a
bottleneck/single point of failure?
It would not be a single point of failure as a new sequencer would be
elected upon failure of the former one. But it might become a bottleneck
at high load yes.
Note that the sequencer algorithm is the simplest one (and fastest in
terms of latency), but there is actually a large number of alternative
AB algorithms that address this issue. For instance, one could
distribute the sequencer for enhancing scalability (a similar scheme was
implemented, for instance, in the Spread Toolkit), or use token based
algorithms that distribute the "sequencing load" across all the nodes
that send/deliver the atomicast broadcast. Alternatively one could use
more complex replication algorithms that avoid sending one AB per
transaction, reducing the burden to the sequencing service. We have very
recently published one such algorithm in [1], which however addresses
the case of fully replicated systems so it should be adapted to work
with Infinispan in its full glory! ;-)
Very interesting. We should investigate and evaluate different AB implementations here.
Again, if they can be made pluggable, all the better.
From my last conversation with Bela, however, there should be only
one
implementation of Atomic Broadcast in JGroups, and that's sequencer
based. So I would start with this, do some performance analysis and then
work on more scalable AB implementations if needed.
Great, I'm glad you are working with Bela and JGroups on this.
Cheers
Manik
--
Manik Surtani
manik(a)jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org