<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    Hi Manik,<br>

    <br>

    up to date my group has been studying how to fit Atomic Broadcast

    (AB) based replication mechanisms within the existing

    1-phase/2-phases commit schemes of Infinispan, without altering

    them. In principle this seems possible, though we will find it out

    only when we advance with the development. <br>

    <br>

    If we found any roadblock, we'll let you know and try to find some

    more generic interface that allows to encapsulate both the current

    2PC mechanisms and the alternative replication schemes that we

    intend to develop. In the meanwhile, sticking with the current

    interfaces seems less intrusive and would allow us to get acquainted

    with the current code base.<br>

    <br>

    Specifically, our ideas here are:<br>

    - for fully replicated system, no distribution. Rather than using

    2PC, we could use the 1PC, with the commit message being AB rather

    than simply broadcast. This message would transmit the set of items

    written by the current xact. Upon delivery of the AB, each node

    should validate the transaction writeset. This in our current

    systems is done by timestamping each transaction as it starts with

    an integer that is incremented whenever a write transaction commits.

    So when a transaction commits, we just check if any of the items it

    wrote has been updated by a transaction having a timestamp larger

    than the one the current transaction had when it started. We took a

    quick look to Infinispan's MVCC implementation, and we got the

    impression that currently there isn't an analogous mechanism. Is it

    correct?<br>

    As a side note, the protocols we presented in Lisbon ensure

    serializability, so they need to deal with the issue of

    disseminating transactions' readsets across nodes. As encoding

    transactions readsets typically implies generating very large

    messages, we have recently proposed a replication scheme that allows

    to significantly reduce the amount of information exchanged by

    encoding the readset in a Bloom Filter.<br>

    On the other hand, by providing repeatable read, and tracking only

    write-write conflicts, Infinispan avoids this kind of issue a

    priori.<br>

    Now, I am not entirely sure if it would make sense to extend

    Infinispan within the Cloud-TM project to provide supports for

    serializability. But if we opt to do so, it would be interesting to

    integrate this technique as well.<br>

    <br>

    - for partially replicated system. This is where 2PC would be

    utilized. The simplest scheme that one could use here would be the

    following (we have come up with a new, more complex protocol, but we

    prefer to advance by small steps implementing a simpler one). During

    the first phase the coordinator would do an Atomic Multicast (AM) to

    the other transaction's participants. Upon delivery of the AM by a

    node "n", the data accessed by the transaction and stored by "n"

    would be locally validated. Note that all replicas of a data would

    deliver the coordinator message in the same order. Thus validation

    would give the same output at all replicas. Also the mechanism would

    be deadlock free. Now there are two options depending on whether we

    want to have a decentralized or centralized scheme. <br>

    &nbsp;&nbsp;&nbsp; a) each participant multicasts (plain) to all other participants

    what is the outcome of the local validation phase. As soon as we

    collect a negative vote, we can abort straightforwardly. Otherwise,

    as soon as a node gathers a positive vote from (at least) one

    replica of each data item accessed by the xact, it can commit. <br>

    &nbsp;&nbsp;&nbsp; b) the participants send to the coordinator the outcome of the

    local validation phase. The coordinator then would behave, like in

    classic 2PC.<br>

    In case a) the number of exchanged messages would be quadratic in

    the number of transaction participants, but the commit latency would

    be that of an AM plus a multicast. In case b) the number of

    exchanged messages would be linear in the number of transaction

    participants, but the commit latency would be that of an AM plus 2

    communication steps (one to deliver the vote to the coordinator, one

    for the coordinator to communicate the decision to the

    participants).<br>

    Note that in case a) we would totally skip the second cycle of the

    2PC (unless we are missing something this should be feasible by

    handling this protocol as a special case in the interceptors'

    chain).<br>

    This protocol (variant a) was actually presented in [1], if you want

    to have more details. <br>

    <br>

    Note that both approaches are deadlock-free, as the transaction

    serialization order is imposed by the order determined by the Atomic

    Broadcast. The cost to implement Atomic Broadcast depends on the

    precise guarantees you want to provide (e.g. upon failure of a node,

    should the system block until he recovers? Note that this is what

    you get typically with 2PC), and on the specific protocol that you

    use. The fastest (in terms of latency) Atomic Broadcast protocols

    are those based on a process, called sequencer, whose role is to

    sequence messages. In this case, an extra communication step (+1 log

    on the sequencer side) would be required in order to obtain the

    serialization number from the sequencer.<br>

    <br>

    Cheers,<br>

    <br>

    <span class="Apple-style-span" style="border-collapse: separate;

      color: rgb(0, 0, 0); font-family: arial,sans-serif; font-style:

      normal; font-variant: normal; font-weight: normal; letter-spacing:

      normal; line-height: normal; orphans: 2; text-indent: 0px;

      text-transform: none; white-space: normal; widows: 2;

      word-spacing: 0px; font-size: small;"><span

        class="Apple-style-span" style="color: rgb(14, 119, 74);

        line-height: 15px;"></span></span>&nbsp;&nbsp;&nbsp; Paolo<br>

    <br>

    -----------------<br>

    [1] <a class="moz-txt-link-abbreviated"

      href="http://www.inf.usi.ch/phd/schiper/research/SRDS10.pdf">www.inf.usi.ch/phd/schiper/research/SRDS10.pdf</a><br>

    <br>

    <br>

    On 10/25/10 5:26 PM, Manik Surtani wrote:

    <blockquote

      cite="mid:8602664C-3379-4A08-8B3C-BA549A70E50E@jboss.org"

      type="cite">

      <pre wrap="">Greetings and welcome to this list, Paolo.  :)

As you said your starting point is looking at the replication mechanisms.  We discussed the current 2-phase scheme in detail when I was in Lisbon, and I am very keen on an alternate atomic broadcast style approach.  You presented a few different approaches even within the broader atomic broadcast umbrella, so it makes sense to make this layer pluggable so we can work with different implementations.

Have you had a look at the existing 2-phase scheme to see how an alternate scheme can fit in, and where we'd need to introduce layers of abstraction?

Cheers

Manik

On 3 Oct 2010, at 19:18, Paolo Romano wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">Hi all,

I am new here, so let me first introduce myself. I am Paolo Romano, a 

researcher working at INESC-ID Lisbon, you can find more about me and my 

research activities at my webpage: <a class="moz-txt-link-freetext" href="http://www.gsd.inesc-id.pt/~romanop">http://www.gsd.inesc-id.pt/~romanop</a>.

I am posting to this mailing list to introduce the Cloud-TM project 

(<a class="moz-txt-link-freetext" href="http://www.cloudtm.eu">http://www.cloudtm.eu</a>), a EU funded project started in June which 

brings together Red Hat, INESC-ID Lisbon (<a class="moz-txt-link-freetext" href="http://www.gsd.inesc-id.pt">http://www.gsd.inesc-id.pt</a>), 

Rome University "La Sapienza" (<a class="moz-txt-link-freetext" href="http://www.dis.uniroma1.it/~hpdcs">http://www.dis.uniroma1.it/~hpdcs</a>) and 

Algorithmica (<a class="moz-txt-link-freetext" href="http://www.algorithmica.it">http://www.algorithmica.it</a>).

Citing the project's abstract:

"Cloud-TM aims at defining a novel programming paradigm to facilitate 

the development and administration of cloud applications. It will 

develop a Self-Optimizing Distributed Transactional Memory middleware 

that will spare programmers from the burden of coding for distribution, 

persistence and fault-tolerance, letting them focus on delivering 

differentiating business value. Further, the Cloud-TM platform aims at 

minimizing the operational costs of cloud applications, pursuing optimal 

efficiency via autonomic resource provisioning and pervasive self-tuning 

schemes."

Infinispan is expected to play a key role in Cloud-TM, as it has been 

chosen as the reference platform to integrate the main research results 

achieved during the project.  Specifically, our plan is to extend 

Infinispan along the following main directions:

1. Build a library of alternative replication mechanisms optimized for 

different workload scenarios (e.g. hi/low conflict rate, read/write 

intensive) and scales of the platform (e.g. few/many nodes, 

local/geographical distribution)

2. Developing self-scaling mechanisms aimed at elastically allocating 

nodes from Cloud computing platforms to Infinispan caches depending on 

the current workload.

3. Developing self-tuning mechanisms that will adaptively alter the data 

replication and distribution algorithms depending on the current 

workload characteristics and scale of the platform.

4. Providing programmers with a Distributed Software Transactional 

Memory interface via a wrapper over Infinispan. This wrapper would be 

close  in spirit to what  PojoCache is for TreeCache, though we are 

currently oriented towards using a Domain Modelling Language and a 

precompilation phase to generate the code to interact with Infinispan 

(along the lines of what is done in the Fenix framework, 

<a class="moz-txt-link-freetext" href="https://fenix-ashes.ist.utl.pt/trac/fenix-framework">https://fenix-ashes.ist.utl.pt/trac/fenix-framework</a>). Note that we are 

still at very early design phase, so we are open to ideas, comments and 

especially to learn from your experiences with PojoCache.

As developers of Infinispan, your feedback is extremely valuable to us. 

On one hand, as nobody better than you could provide us indications on 

how to fit within Infinispan's codebase any new experimental feature we 

will be developing in the least intrusive fashion. On the other hand, as 

you can help us to identify what are the most critical issues for 

realistic deployments of Infinispan in Cloud environments, pointing out, 

for instance, which ones, among the current Infinispan 

paramers/functionalities, would benefit the most from self-tuning 

approaches.

We have already started looking at the internal structure of the 

replication's modules of Infinispan, and in the next days we will be 

posting more about the kind of replication schemes (see point 1 above) 

we would like to integrate in Infinispan, and how we are planning to do so.

In the meanwhile, as a teaser :-), I am sending a reference to a couple 

of recent papers of ours if you are curious to know what kind of 

replication solutions we are currently working on:

- <a class="moz-txt-link-freetext" href="http://www.gsd.inesc-id.pt/~romanop/files/papers/prdc09.pdf">http://www.gsd.inesc-id.pt/~romanop/files/papers/prdc09.pdf</a>

- <a class="moz-txt-link-freetext" href="http://www.gsd.inesc-id.pt/~romanop/files/papers/middleware10.pdf">http://www.gsd.inesc-id.pt/~romanop/files/papers/middleware10.pdf</a>

Cheers,

   Paolo

-- 

Paolo Romano, PhD

Researcher at INESC-ID

Rua Alves Redol, 9

1000-059, Lisbon Portugal

Tel. + 351 21 3100300

Fax  + 351 21 3145843

Webpage <a class="moz-txt-link-freetext" href="http://www.gsd.inesc-id.pt/~romanop">http://www.gsd.inesc-id.pt/~romanop</a>

_______________________________________________

infinispan-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>

<a class="moz-txt-link-freetext" href="https://lists.jboss.org/mailman/listinfo/infinispan-dev">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a>

</pre>

      </blockquote>

      <pre wrap="">

--

Manik Surtani

<a class="moz-txt-link-abbreviated" href="mailto:manik@jboss.org">manik@jboss.org</a>

Lead, Infinispan

Lead, JBoss Cache

<a class="moz-txt-link-freetext" href="http://www.infinispan.org">http://www.infinispan.org</a>

<a class="moz-txt-link-freetext" href="http://www.jbosscache.org">http://www.jbosscache.org</a>

_______________________________________________

infinispan-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:infinispan-dev@lists.jboss.org">infinispan-dev@lists.jboss.org</a>

<a class="moz-txt-link-freetext" href="https://lists.jboss.org/mailman/listinfo/infinispan-dev">https://lists.jboss.org/mailman/listinfo/infinispan-dev</a>

</pre>

    </blockquote>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Paolo Romano, PhD

Researcher at INESC-ID

Rua Alves Redol, 9

1000-059, Lisbon Portugal

Tel. + 351 21 3100300

Fax  + 351 21 3145843

Webpage <a class="moz-txt-link-freetext" href="http://www.gsd.inesc-id.pt/~romanop">http://www.gsd.inesc-id.pt/~romanop</a></pre>

  </body>

</html>