[infinispan-dev] Distribution, take 2

Mon Jul 20 10:38:36 EDT 2009

On 20 Jul 2009, at 14:51, Bela Ban wrote:

> General comments:
> - It seems that this is a mechanism where you replicate to the  
> repl_count instances 'next' to you, ie. similar to Buddy Replication ?

Yes.  So if you consider the hash space on a clockwise-incrementing  
wheel [1], nodes {A, B, C, D} each have fixed positions on this wheel  
(perhaps by making use of the hash codes on their Addresses).  So this  
becomes deterministic.  Any key (K1 ~ K4 in the diagram in [1]) can be  
located by ascertaining their place on the hash wheel, moving  
clockwise, and considering the first "repl_count" nodes encountered.

So yes, this is "similar" to BR in that it uses a notion of fixed  
number of "owner" nodes, but different in several other ways, such as:

* Each node is an "equal" owner
* Owner nodes are not determined by the order they appear in a cluster  
view.  Rather, this is based on their positions in the hash space.

> - Are you tying your rebalancing mechanism to the consistent hash  
> implementation ? This would IMO be bad because it doesn't allow for  
> plugging in of a different CH algorithm !

Unfortunately the steps I have outlined below does imply some  
knowledge of the hash algorithm, to determine which nodes are affected  
by a LEAVE event, to minimise rebalancing.  Perhaps this can be  
provided by the ConsistentHash interface, so that any new  
implementation would need to provide a list of affected nodes when a  
give node leaves.

> More comments inline
>
>
> Manik Surtani wrote:
>> DIST, take 2!
>>
>> Previous design failed due to two flaws:
>> 1.  Transaction logs maintained on sender, for each recipient.   
>> Plenty of scope for races, or heavily synchronized.
>> 2.  Consistent hash attempted to be overly fair in evenly  
>> dispersing nodes across a hash space.  Meant that there was an  
>> often large and unnecessary amount of rehashing to do, which  
>> exacerbated the problem in 1.
>>
>> So we have a new approach, based on the following premises.
>>
>> 1.  Consistent hash (CH) based on fixed positions in the hash space  
>> rather than relative ones.
>
> Do you have a description of how the fixed-positions CH works ? From  
> the bullets below it seems this is like Buddy Replication, where you  
> store the data on the N buddies next to you.

See comments above about this.

>     1.1.  Pros: limited and finite rehashing, particularly when  
> there is a leave.
>>        1.1.1.  If the leaver is L, only node (L - 1) and node (L +  
>> 1) will have to push state, and only (L + 1) and (L + replCount)  
>> will have to receive state.
>>     1.2.  Cons: uneven spread of load (mitigated with grid size)
>
>> 4.  Implementation notes:
>>     4.4. InstallConsistentHashCommand - an RPC command that  
>> "installs" a consistent hash instance on remote nodes.
>
> What does 'installing a consisten hash instance' mean ?

Literally, it means swapping out the CH instance used on a given node  
with an alternate one.  A CH instance encapsulates a "node list" and  
the logic to locate a key within the node list, using hash functions  
to determine which nodes would hold a given key.  See this interface  
here [2], and a default (if currently imperfect) implementation [3].   
CHs are immutable so when a new view is available, a new CH instance  
with a new node list should be created.

In addition, aggregate CHs are allowed, to consider the union of two  
different views.  This is represented by a delegating CH impl, and is  
used in step 5.2 when a JoinTask is kicked off on a new joiner.

>>     4.5. GetConsistentHashCommand - an RPC command that "asks" a  
>> node to serialize and transmit across its current CH impl.
>
> Serialize what ? The state (I assume) ? Or the consistent hash ? How  
> can you serialize a CH in the latter case ?

This refers to the CH impl.  The CH impl that needs to be serialized  
would typically be an instance of [3].  Since the CH impl just  
contains a List of Addresses, this should be possible.  This is  
necessary so that the joiner knows the state of the cluster before it  
joined, and is able to ask specific members for state.

Another option to this is to assume that the previous state is the  
same as the new state, minus the new joiner.  Saves on RPC calls.

Cheers
Manik

[1] http://tinyurl.com/n9lycn
[2] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/distribution/ConsistentHash.java?r=144
[3] http://fisheye.jboss.org/browse/Infinispan/trunk/core/src/main/java/org/infinispan/distribution/DefaultConsistentHash.java?r=207

--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org