<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">So another good thing that came out of this week's travelling between JUGs is that I was able to sit down with Adrian Cole in Berlin and, over a record-breaking number of coffees in a single sitting, we were able to rehash the distribution algorithm, to remove the weird complexities and race conditions we have in the current setup. I've made a note of this below - please have a read through and provide feedback.<div><br></div><div><div>Cheers,</div><div>Manik</div><div><div apple-content-edited="true"><div><div><div>--</div><div>Manik Surtani</div><div><a href="mailto:manik@jboss.org">manik@jboss.org</a></div><div>Lead, Infinispan</div><div>Lead, JBoss Cache</div><div><a href="http://www.infinispan.org">http://www.infinispan.org</a></div><div><a href="http://www.jbosscache.org">http://www.jbosscache.org</a></div><div><br></div></div></div></div></div><div><br></div><div><div><span class="Apple-style-span" style="text-decoration: underline;">DIST, take 2!</span></div><div><br></div><div>Previous design failed due to two flaws:</div><div> 1. Transaction logs maintained on sender, for each recipient. Plenty of scope for races, or heavily synchronized.</div><div> 2. Consistent hash attempted to be overly fair in evenly dispersing nodes across a hash space. Meant that there was an often large and unnecessary amount of rehashing to do, which exacerbated the problem in 1.</div><div><br></div><div>So we have a new approach, based on the following premises.</div><div><br></div><div> 1. Consistent hash (CH) based on fixed positions in the hash space rather than relative ones. </div><div> 1.1. Pros: limited and finite rehashing, particularly when there is a leave.</div><div><span class="Apple-tab-span" style="white-space: pre; ">                </span>1.1.1. If the leaver is L, only node (L - 1) and node (L + 1) will have to push state, and only (L + 1) and (L + replCount) will have to receive state.</div><div> 1.2. Cons: uneven spread of load (mitigated with grid size)</div><div> 1.3. Virtual nodes is a good way to reduce this, but leaving VN out for now since it adds complexity. VN won't change the rest of the algorithm anyway.</div><div><br></div><div> 2. State is both pulled (on JOIN) and pushed (on LEAVE)</div><div> </div><div> 3. State transfers are implemented using RPC commands and byte array payloads</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 3.1. Will later evolve into streaming scheme</div><div><br></div><div> 4. Implementation notes:</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.1. Each node has a DistributionManager (DM)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.2. Each DM has a RehashExecutor (with 1 low prio thread)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.3. LeaveTask and JoinTask are 2 types of tasks handled by this executor encapsulating the processing that takes place when there is a leave or join.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.4. InstallConsistentHashCommand - an RPC command that "installs" a consistent hash instance on remote nodes.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.5. GetConsistentHashCommand - an RPC command that "asks" a node to serialize and transmit across its current CH impl.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.6. PullStateCommand - an RPC command that "asks" for state. Command should have an Address of the requestor. Return value for this command is the state.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.7. PushStateCommand - an RPC command that "pushes" state by providing a payload of state.<span class="Apple-tab-span" style="white-space: pre; ">        </span></div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 4.8. Current RPC responses are either SuccessfulResponses or UnsuccessfulResponses. Introducing a new type, UnsureResponse to deal with eventual consistency semantics of concurrent operation during a rehash. ClusteredGetResponseFilter should look for a quorum for UnsureResponses.</div><div><br></div><div> 5. JoinTask: This is a PULL based rehash. JoinTask is kicked off on the JOINER.</div><div> 5.1. Obtain OLD_CH from coordinator (using GetConsistentHashCommand)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.2. Generate TEMP_CH (which is a union of OLD_CH and NEW_CH)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.3. Broadcast TEMP_CH across the cluster (using InstallConsistentHashCommand)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.4. Log all incoming writes/txs and respond with a positive ack.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.5. Ignore incoming reads, forcing callers to check next owner of data.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.6. Ping each node in OLD_CH's view and ask for state (PullStateCommand)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.7. Apply state received from 5.6.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.8. Drain tx log and apply, stop logging writes once drained.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.9. Reverse 5.5.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.10. Broadcast NEW_CH so this is applied (using InstallConsistentHashCommand)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 5.11. Loop through data container and unicast invalidations for keys that "could" exist on OLD_CH and not in NEW_CH<span class="Apple-tab-span" style="white-space: pre; ">        </span> </div><div><br></div><div> 6. When a leave is detected (view change with 1 less member):</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 6.1. Make a note of the Leaver (L)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 6.2. All nodes switch to a NEW_CH, which does not include L</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 6.3. Nodes (L - replCount + 1), (L + 1) and (L + replCount) kick off a LeaveTask.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> <span class="Apple-tab-span" style="white-space: pre; ">        </span> 6.3.1. Not everyone in the cluster need be involved in this rehash thanks to fixed CH positions.<span class="Apple-tab-span" style="white-space: pre; ">                </span> </div><div><br></div><div> 7. LeaveTask: This is PUSH based</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 7.1. If an existing LeaveTask is running, the existing task should be cancelled first.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 7.2. if is_receiver(): Log all writes/incoming txs</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 7.3. if is_receiver(): Incoming reads:</div><div><span class="Apple-tab-span" style="white-space: pre; ">                </span>7.3.1. If the key was owned by L in OLD_CH and not by self, and in NEW_CH is owned by self, return UnsureResponse. Else, SuccessfulResponse.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 7.4. if is_pusher(): </div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 7.4.1. Iterate over all keys in data container. If any key maps to L in OLD_CH, that key needs to be rehashed. Call addState(stateMap, k). See below for details of this algo.</div><div> 7.4.2. Call PushStateCommand for each recipient in stateMap.</div><div> 7.5. if is_receiver() and on receiving PushStateCommand</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> <span class="Apple-tab-span" style="white-space: pre; ">        </span>7.5.1. Drain tx log and apply, stop logging writes once drained.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> <span class="Apple-tab-span" style="white-space: pre; ">        </span>7.5.2. Reverse 7.3.</div><div><br></div><div> 8. A state map is a data structure representing state generated by each node to be pushed.</div><div> 8.1. StateMap{</div><div> <span class="Apple-tab-span" style="white-space: pre; ">                                </span>Map<Address, Map<Object, InternalCacheEntry>> state</div><div><span class="Apple-tab-span" style="white-space: pre; ">                        </span>}</div><div><br></div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span> 8.2. addState(stateMap, k):</div><div><span class="Apple-tab-span" style="white-space: pre; ">                                </span>OL = OLD_CH.locate(k)</div><div><span class="Apple-tab-span" style="white-space: pre; ">                                </span>if (OL.contains(L) and ((position(L, OL) == last and current_address == L - 1) or (position(L, OL) != last and current_address == L + 1)) {</div><div><span class="Apple-tab-span" style="white-space: pre; ">                                        </span>stateMap.add(NEW_CH.locate(k) - OL, k)</div><div><span class="Apple-tab-span" style="white-space: pre; ">                                        </span>break</div><div><span class="Apple-tab-span" style="white-space: pre; ">                                </span>}</div><div><br></div><div> 9. Functions to determine who pushes and who receives state during a leave:</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>9.1. is_pusher(address): return address == (L + 1) or (L - 1)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>9.2. is_receiver(address): return address == (L + 1) or (L + replCount)</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span></div><div><span class="Apple-style-span" style="text-decoration: underline;">Concurrency</span></div><div><br></div><div>1. Deal with concurrent JOINs</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>1.1. Concurrent JOINs in completely disjoint parts of the hash space will have no overlap whatsoever.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>1.2. Concurrent JOINs in the same hash space would work as well, since JOINs are controlled by the joiner.</div><div><br></div><div>2. Deal with concurrent LEAVEs</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>2.1. Concurrent LEAVEs in completely disjoint parts of the hash space will have no overlap whatsoever.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>2.2. Concurrent LEAVEs that do not affect the same nodes involved in providing or receiving state will not be a problem.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>2.3. Concurrent LEAVES affecting the same nodes already providing or receiving state will interrupt the LeaveTasks on those nodes. Leave tasks will need to clean up appropriately, which may involve issuing a "cleanup command" to ensure partially transmitted state is reconciled.</div><div><br></div><div>3. Deal with concurrent LEAVE and JOIN</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>3.1. Concurrent JOIN and LEAVEs in completely disjoint parts of the hash space will have no overlap whatsoever.</div><div><span class="Apple-tab-span" style="white-space: pre; ">        </span>3.2. FOr overlapping areas of hash space, this needs further thought.</div><div><br></div><div><div apple-content-edited="true"><div><br></div><br></div><br></div></div></div></body></html>