Hi all,
I’ve spent quite a bit of time with the existing XSite implementation, getting my solution to run in multiple data centers. I’ve been talking with Mircea and Bela on and off over the past few weeks, but since this affects the community and potentially commercial customers, I wanted to share with a wider audience.
The problems I see are as follows:
1. A bridge end will see all the traffic – sending and receiving – for all nodes within a cluster.
2. The bridge end of a site will apply each change with the bridge end as the transaction originator.
In my deployment, this can be three sites backing up their data to the other two. So for 3 sites of 12 nodes each, a single bridge end will see all 36 nodes worth of traffic. This breaks linear scalability. In my QA’s testing, a 3 DC cluster of 6 nodes is about 1/10 the throughput of a single cluster.
I think I-RAC solves some of the problem, like the reliable sending of data, but it doesn’t really help with performance in high throughput cases. (Note: FWIW, my apps do about a 8-12:1 read/write ratio).
So I’ve prototyped the following:
1. Load-balanced applying the changes from a remote SiteMaster across all local nodes in a cluster. The basics are that there is still a single SiteMaster (thereby not breaking the existing JGroups model). This is okay, since it’s the same bandwidth pipe, and as long as there is no unmarshalling, it’s a simple buffer copy. The difference is that the messages are now forwarded to other nodes in the local cluster and delivered to the ISPN layer there for unmarshalling and data application. Note that this does NOT break XSite synchronous replication, as I’m still preserving the originating site.
2. I also needed more intelligent application of the data that is replicated. My local cluster will save data to 8-9 caches that need to be replicated. Instead of replicating data on cache boundaries, I consolidated the data to only replicate an aggregate object. In turn, I have a custom BackupReceiver implementation that takes this object and expands it into the requisite data for the 8-9 caches. Since these caches are a mixture of optimistic and pessimistic modes, I made liberal use of the Distributed Executor framework to execute on the data owner for any pessimistic caches.
The early results are very promising, especially WRT async replication. (Sync replication just sucks – trying to design that out of the system)
There are a few changes made to support a custom BackupReceiver implementation [1]. There are some other ideas I had floating around in my head, but it’s such a fine line between clever and stupid.
1. I’m prototyping an option where the SiteMaster would load-balance among the other cluster members, but exclude itself. Preliminary testing shows that this really only helps when the cluster size > numOwners + 1.
2. I think the staggered get work will be vital in these deployments. Mircea had the idea of suppressing reads on the SiteMaster node.
3. In the case of numerous modifications, I’ve seen them processed “out of order”. This is where I-RAC, with batched modifications, can mitigate this. (Maybe Total Order?)
4. In the case of single-key modifications, I was playing around with adding a int header to the RELAY2 message to give any application a “hint” about the hash code of the key. When ISPN received the data, there would be an initial deserialization penalty as ISPN would need to map the hash code to the address of the data owner for that cache (in 5.2 onwards, there is no guarantee that the segments for caches in a cache manager will map as in 5.1.x). This would build a “routing table” of sorts to RELAY2, so if it sees this key, it’ll know to route it to the owner directly. (And on a view change, this table would be cleared) Really, though, this was a thought for improving synchronous replication performance.
Any other thoughts? Feedback?
[1] https://github.com/an1310/infinispan/tree/t_receiver
Regards,
Erik