Hi Erik

This makes a lot of sense. In fact, I was really close to implementing it while I was replacing RebalancePolicy with AvailabilityStrategy. Unfortunately I hit some problems and I had to postpone it (mostly because I was also trying to make the flag per-cache).

The only question is what happens after a merge, if one partition has rebalancing enabled, and the other has rebalancing disabled.

I think I would prefer to keep it disabled if at least one partition had it disabled. E.g. if you start a new node and it doesn't join properly, you wouldn't want it to trigger a rebalance when it finally finds the cluster, only after you enable rebalancing yourself.

Cheers

Dan

On Tue, Oct 28, 2014 at 12:00 AM, Erik Salter <an1310@hotmail.com> wrote:

Hi all,

This topic came up in a separate discussion with Mircea, and he suggested
I post something on the mailing list for a wider audience.

I have a business case where I need the value of the rebalancing flag read
by the joining nodes. Let's say we have a TACH where we want our keys
striped across machines, racks, etc. Due to how NBST works, if we start a
bunch of nodes on one side of the topology marker, we'rewill end up with
the case where all keys will dog-pile on the first node that joins before
being disseminated to the other nodes. In other words, the first joining
node on the other side of the topology acts as a "pivot." That's bad,
especially if the key is marked as DELTA_WRITE, where the receiving node
must pull the key from the readCH before applying the changelog.

So not only do we have a single choke-point, but it's made worse by the
initial burst of every write requiring numOwner threads for remote reads.

If we disable rebalancing and start up the nodes on the other side of the
topology, we can process this in a single view change. But there's a
catch -- and this is the reason I added the state of the flag. We've run
into a case where the current coordinator changed (crash or a MERGE) as
the other nodes are starting up. And the new coordinator was elected from
the new side of the topology. So we had two separate but balanced CHs on
both sides of the topology. And data integrity went out the window.

Hence the flag. Note also that this deployment requires the
awaitInitialTransfer flag to be false.

In a real production environment, this has saved me more times than I can
count. Node failover/failback is now reasonably deterministic with a
simple operational procedure for our customer(s) to follow.

The question is whether this feature would be useful for the community.
Even with the new partition handling, I think this implementation is still
viable and may warrant inclusion into 7.0 (or 7.1). What does the team
think? I welcome any and all feedback.

Regards,

Erik Salter
Cisco Systems, SPVTG
(404) 317-0693

_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev