On 5/18/13 3:25 AM, Erik Salter wrote:
Hi all,
I’ve spent quite a bit of time with the existing XSite implementation,
getting my solution to run in multiple data centers. I’ve been talking
with Mircea and Bela on and off over the past few weeks, but since this
affects the community and potentially commercial customers, I wanted to
share with a wider audience.
The problems I see are as follows:
1.A bridge end will see all the traffic – sending and receiving – for
all nodes within a cluster.
2.The bridge end of a site will apply each change with the bridge end as
the transaction originator.
The reasons for this were:
- WAN link between site masters (high latency, possibly low bandwidth);
send the data only once across that link
- Minimize number of site masters, simplifies cross-site cluster
configuration (not yet implemented)
- Minimize number of site masters, simplifies firewall management (not
yet done)
In my deployment, this can be three sites backing up their data to
the other two.
For stage 1, we said we would not support this, unless the data was
disjoint, ie. SFO owning and backing up keys A-K to LON and LON owning
and backing up keys L-Z to SFO. If both sites had the same or
overlapping key set, then we'd support it if the sites were modifying
the key set at *different times*.
Are you accessing (changing) the same keys in different sites at the
same time ? This would mean we've progressed to the second stage already
? :-)
So for 3 sites of 12 nodes each, a single bridge end will
see all 36 nodes worth of traffic.
Wouldn't it be 24 ? We also have to differentiate between incoming and
outgoing traffic, and between internal (site-local) and external traffic
(between sites).
A bridge end (= site master) sees internal traffic, but we could exclude
the site master from getting internal traffic by installing a consistent
hash function which excludes the site master from storing any data, so
handling internal traffic doesn't slow the site master down.
Has this been done yet ?
A site master does get external traffic (from the other 2 sites), but I
don't think it should be overwhelmed by it because:
#1 Only the result of a successful transaction is sent to the backup
sites. Intermediate modifications/prepares are *not* seen by the site
master (if we exclude it from storing dta, as mentioned above). I would
say that the traffic from a transaction is usually a fraction of that
leading up to the transaction.
#1a The above refers to ASYNC replication between sites. We don't
recommend SYNC replication, the reasons are detailed in the I-RAC wiki [1].
#2 Infinispan caches are designed for high read / low write access, this
naturally keeps traffic down. Am I smelling an abuse of this design
pattern here ? :-)
This breaks linear scalability. In
my QA’s testing, a 3 DC cluster of 6 nodes is about 1/10 the throughput
of a single cluster.
Is this with high reads and async transactional xsite replication
excluding the site master ?
I think I-RAC solves some of the problem, like the reliable sending
of
data, but it doesn’t really help with performance in high throughput
cases. (Note: FWIW, my apps do about a 8-12:1 read/write ratio).
Hmm, so you are indeed using the cache in the right way... The traffic
going between the site masters should then be minimal, even if we get
traffic from 2 sites...
So I’ve prototyped the following:
1.Load-balanced applying the changes from a remote SiteMaster across all
local nodes in a cluster. The basics are that there is still a single
SiteMaster (thereby not breaking the existing JGroups model). This is
okay, since it’s the same bandwidth pipe, and as long as there is no
unmarshalling, it’s a simple buffer copy. The difference is that the
messages are now forwarded to other nodes in the local cluster and
delivered to the ISPN layer there for unmarshalling and data
application. Note that this does NOT break XSite synchronous
replication, as I’m still preserving the originating site.
Do you forward traffic to random nodes in the site ?
This is what we discussed in our last call, and I'm curious to see what
the numbers are.
Have you given excluding the site master from storing data a try too ?
It should be relatively simple to install a consistent hash which
excludes the SM.
Note that I don't think SYNC xsite replication is feasible, for the
reasons listed in [1] !
2.I also needed more intelligent application of the data that is
replicated. My local cluster will save data to 8-9 caches that need to
be replicated. Instead of replicating data on cache boundaries, I
consolidated the data to only replicate an aggregate object. In turn, I
have a custom BackupReceiver implementation that takes this object and
expands it into the requisite data for the 8-9 caches. Since these
caches are a mixture of optimistic and pessimistic modes, I made liberal
use of the Distributed Executor framework to execute on the data owner
for any pessimistic caches.
Not sure I understand. Are you saying that we currently perform RPCs to
8-9 caches on the receiver site when applying the data ? If that's the
case, then - yes - sending an aggregate object across makes a lot of sense !
The early results are very promising, especially WRT async
replication.
Numbers ?
(Sync replication just sucks – trying to design that out of the
system)
+1
1.I’m prototyping an option where the SiteMaster would load-balance
among the other cluster members, but exclude itself. Preliminary
testing shows that this really only helps when the cluster size >
numOwners + 1.
Is this the same as #1 above except that the site master excludes itself ?
Numbers ? I think the condition cluster size > numOwners +1 is
frequently true.
2.I think the staggered get work will be vital in these deployments.
Mircea had the idea of suppressing reads on the SiteMaster node.
3.In the case of numerous modifications, I’ve seen them processed “out
of order”. This is where I-RAC, with batched modifications, can
mitigate this. (Maybe Total Order?)
4.In the case of single-key modifications, I was playing around with
adding a int header to the RELAY2 message to give any application a
“hint” about the hash code of the key. When ISPN received the data,
there would be an initial deserialization penalty as ISPN would need to
map the hash code to the address of the data owner for that cache (in
5.2 onwards, there is no guarantee that the segments for caches in a
cache manager will map as in 5.1.x). This would build a “routing table”
of sorts to RELAY2, so if it sees this key, it’ll know to route it to
the owner directly. (And on a view change, this table would be
cleared) Really, though, this was a thought for improving synchronous
replication performance.
Not sure I understand this fully. In any case, RELAY2 cannot and should
not know about key distribution; this is Infinispan's realm.
However, we could take your idea of #1 above and apply it at the JGroups
level: when a replication message is received by a site master, it picks
a random node (possibly excluding itself) in the cluster and forwards
the *not-yet deserialized* message to that node which then applies the
changes.
Do you have any numbers how the changes you made affect performance ?
[1]
https://community.jboss.org/wiki/RAC
--
Bela Ban, JGroups lead (
http://www.jgroups.org)