Hi all,
I've spent quite a bit of time with the existing XSite implementation,
getting my solution to run in multiple data centers. I've been talking with
Mircea and Bela on and off over the past few weeks, but since this affects
the community and potentially commercial customers, I wanted to share with a
wider audience.
The problems I see are as follows:
1. A bridge end will see all the traffic - sending and receiving - for
all nodes within a cluster.
2. The bridge end of a site will apply each change with the bridge end
as the transaction originator.
In my deployment, this can be three sites backing up their data to the other
two. So for 3 sites of 12 nodes each, a single bridge end will see all 36
nodes worth of traffic. This breaks linear scalability. In my QA's
testing, a 3 DC cluster of 6 nodes is about 1/10 the throughput of a single
cluster.
I think I-RAC solves some of the problem, like the reliable sending of data,
but it doesn't really help with performance in high throughput cases.
(Note: FWIW, my apps do about a 8-12:1 read/write ratio).
So I've prototyped the following:
1. Load-balanced applying the changes from a remote SiteMaster across
all local nodes in a cluster. The basics are that there is still a single
SiteMaster (thereby not breaking the existing JGroups model). This is okay,
since it's the same bandwidth pipe, and as long as there is no
unmarshalling, it's a simple buffer copy. The difference is that the
messages are now forwarded to other nodes in the local cluster and delivered
to the ISPN layer there for unmarshalling and data application. Note that
this does NOT break XSite synchronous replication, as I'm still preserving
the originating site.
2. I also needed more intelligent application of the data that is
replicated. My local cluster will save data to 8-9 caches that need to be
replicated. Instead of replicating data on cache boundaries, I consolidated
the data to only replicate an aggregate object. In turn, I have a custom
BackupReceiver implementation that takes this object and expands it into the
requisite data for the 8-9 caches. Since these caches are a mixture of
optimistic and pessimistic modes, I made liberal use of the Distributed
Executor framework to execute on the data owner for any pessimistic caches.
The early results are very promising, especially WRT async replication.
(Sync replication just sucks - trying to design that out of the system)
There are a few changes made to support a custom BackupReceiver
implementation [1]. There are some other ideas I had floating around in my
head, but it's such a fine line between clever and stupid.
1. I'm prototyping an option where the SiteMaster would load-balance
among the other cluster members, but exclude itself. Preliminary testing
shows that this really only helps when the cluster size > numOwners + 1.
2. I think the staggered get work will be vital in these deployments.
Mircea had the idea of suppressing reads on the SiteMaster node.
3. In the case of numerous modifications, I've seen them processed "out
of order". This is where I-RAC, with batched modifications, can mitigate
this. (Maybe Total Order?)
4. In the case of single-key modifications, I was playing around with
adding a int header to the RELAY2 message to give any application a "hint"
about the hash code of the key. When ISPN received the data, there would be
an initial deserialization penalty as ISPN would need to map the hash code
to the address of the data owner for that cache (in 5.2 onwards, there is no
guarantee that the segments for caches in a cache manager will map as in
5.1.x). This would build a "routing table" of sorts to RELAY2, so if it
sees this key, it'll know to route it to the owner directly. (And on a view
change, this table would be cleared) Really, though, this was a thought for
improving synchronous replication performance.
Any other thoughts? Feedback?
[1] <
https://github.com/an1310/infinispan/tree/t_receiver>
https://github.com/an1310/infinispan/tree/t_receiver
Regards,
Erik