[infinispan-dev] XSite Performance

Tue May 21 02:20:58 EDT 2013

On 5/18/13 3:25 AM, Erik Salter wrote:
> Hi all,
>
> I’ve spent quite a bit of time with the existing XSite implementation,
> getting my solution to run in multiple data centers.  I’ve been talking
> with Mircea and Bela on and off over the past few weeks, but since this
> affects the community and potentially commercial customers, I wanted to
> share with a wider audience.
>
>   The problems I see are as follows:
>
> 1.A bridge end will see all the traffic – sending and receiving – for
> all nodes within a cluster.
>
> 2.The bridge end of a site will apply each change with the bridge end as
> the transaction originator.

The reasons for this were:
- WAN link between site masters (high latency, possibly low bandwidth); 
send the data only once across that link
- Minimize number of site masters, simplifies cross-site cluster 
configuration (not yet implemented)
- Minimize number of site masters, simplifies firewall management (not 
yet done)

> In my deployment, this can be three sites backing up their data to the other two.

For stage 1, we said we would not support this, unless the data was 
disjoint, ie. SFO owning and backing up keys A-K to LON and LON owning 
and backing up keys L-Z to SFO. If both sites had the same or 
overlapping key set, then we'd support it if the sites were modifying 
the key set at *different times*.

Are you accessing (changing) the same keys in different sites at the 
same time ? This would mean we've progressed to the second stage already 
? :-)

> So for 3 sites of 12 nodes each, a single bridge end will
> see all 36 nodes worth of traffic.

Wouldn't it be 24 ? We also have to differentiate between incoming and 
outgoing traffic, and between internal (site-local) and external traffic 
(between sites).

A bridge end (= site master) sees internal traffic, but we could exclude 
the site master from getting internal traffic by installing a consistent 
hash function which excludes the site master from storing any data, so 
handling internal traffic doesn't slow the site master down.
Has this been done yet ?

A site master does get external traffic (from the other 2 sites), but I 
don't think it should be overwhelmed by it because:

#1 Only the result of a successful transaction is sent to the backup 
sites. Intermediate modifications/prepares are *not* seen by the site 
master (if we exclude it from storing dta, as mentioned above). I would 
say that the traffic from a transaction is usually a fraction of that 
leading up to the transaction.

#1a The above refers to ASYNC replication between sites. We don't 
recommend SYNC replication, the reasons are detailed in the I-RAC wiki [1].

#2 Infinispan caches are designed for high read / low write access, this 
naturally keeps traffic down. Am I smelling an abuse of this design 
pattern here ? :-)

>  This breaks linear scalability.  In
> my QA’s testing, a 3 DC cluster of 6 nodes is about 1/10 the throughput
> of a single cluster.

Is this with high reads and async transactional xsite replication 
excluding the site master ?

> I think I-RAC solves some of the problem, like the reliable sending of
> data, but it doesn’t really help with performance in high throughput
> cases.  (Note: FWIW, my apps do about a 8-12:1 read/write ratio).

Hmm, so you are indeed using the cache in the right way... The traffic 
going between the site masters should then be minimal, even if we get 
traffic from 2 sites...

> So I’ve prototyped the following:
>
> 1.Load-balanced applying the changes from a remote SiteMaster across all
> local nodes in a cluster.   The basics are that there is still a single
> SiteMaster (thereby not breaking the existing JGroups model).  This is
> okay, since it’s the same bandwidth pipe, and as long as there is no
> unmarshalling, it’s a simple buffer copy.  The difference is that the
> messages are now forwarded to other nodes in the local cluster and
> delivered to the ISPN layer there for unmarshalling and data
> application.   Note that this does NOT break XSite synchronous
> replication, as I’m still preserving the originating site.

Do you forward traffic to random nodes in the site ?
This is what we discussed in our last call, and I'm curious to see what 
the numbers are.

Have you given excluding the site master from storing data a try too ? 
It should be relatively simple to install a consistent hash which 
excludes the SM.

Note that I don't think SYNC xsite replication is feasible, for the 
reasons listed in [1] !

> 2.I also needed more intelligent application of the data that is
> replicated.  My local cluster will save data to 8-9 caches that need to
> be replicated.  Instead of replicating data on cache boundaries, I
> consolidated the data to only replicate an aggregate object.  In turn, I
> have a custom  BackupReceiver implementation that takes this object and
> expands it into the requisite data for the 8-9 caches.  Since these
> caches are a mixture of optimistic and pessimistic modes, I made liberal
> use of the Distributed Executor framework to execute on the data owner
> for any pessimistic caches.

Not sure I understand. Are you saying that we currently perform RPCs to 
8-9 caches on the receiver site when applying the data ? If that's the 
case, then - yes - sending an aggregate object across makes a lot of sense !

> The early results are very promising, especially WRT async replication.

Numbers ?

> (Sync replication just sucks – trying to design that out of the system)

+1

> 1.I’m prototyping an option where the SiteMaster would load-balance
> among the other cluster members, but exclude itself.  Preliminary
> testing shows that this really only helps when the cluster size >
> numOwners + 1.

Is this the same as #1 above except that the site master excludes itself ?

Numbers ? I think the condition cluster size > numOwners +1 is 
frequently true.

> 2.I think the staggered get work will be vital in these deployments.
> Mircea had the idea of suppressing reads on the SiteMaster node.
>
> 3.In the case of numerous modifications, I’ve seen them processed “out
> of order”.  This is where I-RAC, with batched modifications, can
> mitigate this.  (Maybe Total Order?)
>
> 4.In the case of single-key modifications, I was playing around with
> adding a int header to the RELAY2 message to give any application a
> “hint” about the hash code of the key.  When ISPN received the data,
> there would be an initial deserialization penalty as ISPN would need to
> map the hash code to the address of the data owner for that cache (in
> 5.2 onwards, there is no guarantee that the segments for caches in a
> cache manager will map as in 5.1.x).  This would build a “routing table”
> of sorts to RELAY2, so if it sees this key, it’ll know to route it to
> the owner directly.  (And on a view change, this table would be
> cleared)  Really, though, this was a thought for improving synchronous
> replication performance.

Not sure I understand this fully. In any case, RELAY2 cannot and should 
not know about key distribution; this is Infinispan's realm.

However, we could take your idea of #1 above and apply it at the JGroups 
level: when a replication message is received by a site master, it picks 
a random node (possibly excluding itself) in the cluster and forwards 
the *not-yet deserialized* message to that node which then applies the 
changes.

Do you have any numbers how the changes you made affect performance ?

> Any other thoughts?  Feedback?
>
> [1] https://github.com/an1310/infinispan/tree/t_receiver
>
> Regards,
>
> Erik

[1] https://community.jboss.org/wiki/RAC

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)