[infinispan-dev] Cloud-TM & Infinispan

Paolo Romano romanop at gsd.inesc-id.pt
Wed Oct 27 04:43:38 EDT 2010


  Hi Manik,

up to date my group has been studying how to fit Atomic Broadcast (AB) 
based replication mechanisms within the existing 1-phase/2-phases commit 
schemes of Infinispan, without altering them. In principle this seems 
possible, though we will find it out only when we advance with the 
development.

If we found any roadblock, we'll let you know and try to find some more 
generic interface that allows to encapsulate both the current 2PC 
mechanisms and the alternative replication schemes that we intend to 
develop. In the meanwhile, sticking with the current interfaces seems 
less intrusive and would allow us to get acquainted with the current 
code base.

Specifically, our ideas here are:
- for fully replicated system, no distribution. Rather than using 2PC, 
we could use the 1PC, with the commit message being AB rather than 
simply broadcast. This message would transmit the set of items written 
by the current xact. Upon delivery of the AB, each node should validate 
the transaction writeset. This in our current systems is done by 
timestamping each transaction as it starts with an integer that is 
incremented whenever a write transaction commits. So when a transaction 
commits, we just check if any of the items it wrote has been updated by 
a transaction having a timestamp larger than the one the current 
transaction had when it started. We took a quick look to Infinispan's 
MVCC implementation, and we got the impression that currently there 
isn't an analogous mechanism. Is it correct?
As a side note, the protocols we presented in Lisbon ensure 
serializability, so they need to deal with the issue of disseminating 
transactions' readsets across nodes. As encoding transactions readsets 
typically implies generating very large messages, we have recently 
proposed a replication scheme that allows to significantly reduce the 
amount of information exchanged by encoding the readset in a Bloom Filter.
On the other hand, by providing repeatable read, and tracking only 
write-write conflicts, Infinispan avoids this kind of issue a priori.
Now, I am not entirely sure if it would make sense to extend Infinispan 
within the Cloud-TM project to provide supports for serializability. But 
if we opt to do so, it would be interesting to integrate this technique 
as well.

- for partially replicated system. This is where 2PC would be utilized. 
The simplest scheme that one could use here would be the following (we 
have come up with a new, more complex protocol, but we prefer to advance 
by small steps implementing a simpler one). During the first phase the 
coordinator would do an Atomic Multicast (AM) to the other transaction's 
participants. Upon delivery of the AM by a node "n", the data accessed 
by the transaction and stored by "n" would be locally validated. Note 
that all replicas of a data would deliver the coordinator message in the 
same order. Thus validation would give the same output at all replicas. 
Also the mechanism would be deadlock free. Now there are two options 
depending on whether we want to have a decentralized or centralized scheme.
     a) each participant multicasts (plain) to all other participants 
what is the outcome of the local validation phase. As soon as we collect 
a negative vote, we can abort straightforwardly. Otherwise, as soon as a 
node gathers a positive vote from (at least) one replica of each data 
item accessed by the xact, it can commit.
     b) the participants send to the coordinator the outcome of the 
local validation phase. The coordinator then would behave, like in 
classic 2PC.
In case a) the number of exchanged messages would be quadratic in the 
number of transaction participants, but the commit latency would be that 
of an AM plus a multicast. In case b) the number of exchanged messages 
would be linear in the number of transaction participants, but the 
commit latency would be that of an AM plus 2 communication steps (one to 
deliver the vote to the coordinator, one for the coordinator to 
communicate the decision to the participants).
Note that in case a) we would totally skip the second cycle of the 2PC 
(unless we are missing something this should be feasible by handling 
this protocol as a special case in the interceptors' chain).
This protocol (variant a) was actually presented in [1], if you want to 
have more details.

Note that both approaches are deadlock-free, as the transaction 
serialization order is imposed by the order determined by the Atomic 
Broadcast. The cost to implement Atomic Broadcast depends on the precise 
guarantees you want to provide (e.g. upon failure of a node, should the 
system block until he recovers? Note that this is what you get typically 
with 2PC), and on the specific protocol that you use. The fastest (in 
terms of latency) Atomic Broadcast protocols are those based on a 
process, called sequencer, whose role is to sequence messages. In this 
case, an extra communication step (+1 log on the sequencer side) would 
be required in order to obtain the serialization number from the sequencer.

Cheers,

     Paolo

-----------------
[1] www.inf.usi.ch/phd/schiper/research/SRDS10.pdf


On 10/25/10 5:26 PM, Manik Surtani wrote:
> Greetings and welcome to this list, Paolo.  :)
>
> As you said your starting point is looking at the replication mechanisms.  We discussed the current 2-phase scheme in detail when I was in Lisbon, and I am very keen on an alternate atomic broadcast style approach.  You presented a few different approaches even within the broader atomic broadcast umbrella, so it makes sense to make this layer pluggable so we can work with different implementations.
>
> Have you had a look at the existing 2-phase scheme to see how an alternate scheme can fit in, and where we'd need to introduce layers of abstraction?
>
> Cheers
> Manik
>
>
> On 3 Oct 2010, at 19:18, Paolo Romano wrote:
>
>> Hi all,
>>
>> I am new here, so let me first introduce myself. I am Paolo Romano, a
>> researcher working at INESC-ID Lisbon, you can find more about me and my
>> research activities at my webpage: http://www.gsd.inesc-id.pt/~romanop.
>>
>> I am posting to this mailing list to introduce the Cloud-TM project
>> (http://www.cloudtm.eu), a EU funded project started in June which
>> brings together Red Hat, INESC-ID Lisbon (http://www.gsd.inesc-id.pt),
>> Rome University "La Sapienza" (http://www.dis.uniroma1.it/~hpdcs) and
>> Algorithmica (http://www.algorithmica.it).
>>
>> Citing the project's abstract:
>> "Cloud-TM aims at defining a novel programming paradigm to facilitate
>> the development and administration of cloud applications. It will
>> develop a Self-Optimizing Distributed Transactional Memory middleware
>> that will spare programmers from the burden of coding for distribution,
>> persistence and fault-tolerance, letting them focus on delivering
>> differentiating business value. Further, the Cloud-TM platform aims at
>> minimizing the operational costs of cloud applications, pursuing optimal
>> efficiency via autonomic resource provisioning and pervasive self-tuning
>> schemes."
>>
>> Infinispan is expected to play a key role in Cloud-TM, as it has been
>> chosen as the reference platform to integrate the main research results
>> achieved during the project.  Specifically, our plan is to extend
>> Infinispan along the following main directions:
>> 1. Build a library of alternative replication mechanisms optimized for
>> different workload scenarios (e.g. hi/low conflict rate, read/write
>> intensive) and scales of the platform (e.g. few/many nodes,
>> local/geographical distribution)
>> 2. Developing self-scaling mechanisms aimed at elastically allocating
>> nodes from Cloud computing platforms to Infinispan caches depending on
>> the current workload.
>> 3. Developing self-tuning mechanisms that will adaptively alter the data
>> replication and distribution algorithms depending on the current
>> workload characteristics and scale of the platform.
>> 4. Providing programmers with a Distributed Software Transactional
>> Memory interface via a wrapper over Infinispan. This wrapper would be
>> close  in spirit to what  PojoCache is for TreeCache, though we are
>> currently oriented towards using a Domain Modelling Language and a
>> precompilation phase to generate the code to interact with Infinispan
>> (along the lines of what is done in the Fenix framework,
>> https://fenix-ashes.ist.utl.pt/trac/fenix-framework). Note that we are
>> still at very early design phase, so we are open to ideas, comments and
>> especially to learn from your experiences with PojoCache.
>>
>> As developers of Infinispan, your feedback is extremely valuable to us.
>> On one hand, as nobody better than you could provide us indications on
>> how to fit within Infinispan's codebase any new experimental feature we
>> will be developing in the least intrusive fashion. On the other hand, as
>> you can help us to identify what are the most critical issues for
>> realistic deployments of Infinispan in Cloud environments, pointing out,
>> for instance, which ones, among the current Infinispan
>> paramers/functionalities, would benefit the most from self-tuning
>> approaches.
>>
>> We have already started looking at the internal structure of the
>> replication's modules of Infinispan, and in the next days we will be
>> posting more about the kind of replication schemes (see point 1 above)
>> we would like to integrate in Infinispan, and how we are planning to do so.
>> In the meanwhile, as a teaser :-), I am sending a reference to a couple
>> of recent papers of ours if you are curious to know what kind of
>> replication solutions we are currently working on:
>> - http://www.gsd.inesc-id.pt/~romanop/files/papers/prdc09.pdf
>> - http://www.gsd.inesc-id.pt/~romanop/files/papers/middleware10.pdf
>>
>> Cheers,
>>
>>     Paolo
>>
>> -- 
>>
>> Paolo Romano, PhD
>> Researcher at INESC-ID
>> Rua Alves Redol, 9
>> 1000-059, Lisbon Portugal
>> Tel. + 351 21 3100300
>> Fax  + 351 21 3145843
>> Webpage http://www.gsd.inesc-id.pt/~romanop
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> --
> Manik Surtani
> manik at jboss.org
> Lead, Infinispan
> Lead, JBoss Cache
> http://www.infinispan.org
> http://www.jbosscache.org
>
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


-- 

Paolo Romano, PhD
Researcher at INESC-ID
Rua Alves Redol, 9
1000-059, Lisbon Portugal
Tel. + 351 21 3100300
Fax  + 351 21 3145843
Webpage http://www.gsd.inesc-id.pt/~romanop

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20101027/b4475652/attachment-0001.html 


More information about the infinispan-dev mailing list