[infinispan-dev] Distribution, take 2

Fri Jul 17 14:24:26 EDT 2009

So another good thing that came out of this week's travelling between  
JUGs is that I was able to sit down with Adrian Cole in Berlin and,  
over a record-breaking number of coffees in a single sitting, we were  
able to rehash the distribution algorithm, to remove the weird  
complexities and race conditions we have in the current setup.  I've  
made a note of this below - please have a read through and provide  
feedback.

Cheers,
Manik
--
Manik Surtani
manik at jboss.org
Lead, Infinispan
Lead, JBoss Cache
http://www.infinispan.org
http://www.jbosscache.org

DIST, take 2!

Previous design failed due to two flaws:
  1.  Transaction logs maintained on sender, for each recipient.   
Plenty of scope for races, or heavily synchronized.
  2.  Consistent hash attempted to be overly fair in evenly dispersing  
nodes across a hash space.  Meant that there was an often large and  
unnecessary amount of rehashing to do, which exacerbated the problem  
in 1.

So we have a new approach, based on the following premises.

  1.  Consistent hash (CH) based on fixed positions in the hash space  
rather than relative ones.
      1.1.  Pros: limited and finite rehashing, particularly when  
there is a leave.
		1.1.1.  If the leaver is L, only node (L - 1) and node (L + 1) will  
have to push state, and only (L + 1) and (L + replCount) will have to  
receive state.
      1.2.  Cons: uneven spread of load (mitigated with grid size)
      1.3.  Virtual nodes is a good way to reduce this, but leaving VN  
out for now since it adds complexity.  VN won't change  the rest of  
the algorithm anyway.

  2.  State is both pulled (on JOIN) and pushed (on LEAVE)

  3.  State transfers are implemented using RPC commands and byte  
array payloads
	 3.1. Will later evolve into streaming scheme

  4.  Implementation notes:
	 4.1. Each node has a DistributionManager (DM)
	 4.2. Each DM has a RehashExecutor (with 1 low prio thread)
	 4.3. LeaveTask and JoinTask are 2 types of tasks handled by this  
executor encapsulating the processing that takes place when there is a  
leave or join.
	 4.4. InstallConsistentHashCommand - an RPC command that "installs" a  
consistent hash instance on remote nodes.
	 4.5. GetConsistentHashCommand - an RPC command that "asks" a node to  
serialize and transmit across its current CH impl.
	 4.6. PullStateCommand - an RPC command that "asks" for state.   
Command should have an Address of the requestor.  Return value for  
this command is the state.
	 4.7. PushStateCommand - an RPC command that "pushes" state by  
providing a payload of state.	
	 4.8. Current RPC responses are either SuccessfulResponses or  
UnsuccessfulResponses.  Introducing a new type, UnsureResponse to deal  
with eventual consistency semantics of concurrent operation during a  
rehash.  ClusteredGetResponseFilter should look for a quorum for  
UnsureResponses.

  5.  JoinTask: This is a PULL based rehash.  JoinTask is kicked off  
on the JOINER.
      5.1.  Obtain OLD_CH from coordinator (using  
GetConsistentHashCommand)
	 5.2.  Generate TEMP_CH (which is a union of OLD_CH and NEW_CH)
	 5.3.  Broadcast TEMP_CH across the cluster (using  
InstallConsistentHashCommand)
	 5.4.  Log all incoming writes/txs and respond with a positive ack.
	 5.5.  Ignore incoming reads, forcing callers to check next owner of  
data.
	 5.6.  Ping each node in OLD_CH's view and ask for state  
(PullStateCommand)
	 5.7.  Apply state received from 5.6.
	 5.8.  Drain tx log and apply, stop logging writes once drained.
	 5.9.  Reverse 5.5.
	 5.10. Broadcast NEW_CH so this is applied (using  
InstallConsistentHashCommand)
	 5.11. Loop through data container and unicast invalidations for keys  
that "could" exist on OLD_CH and not in NEW_CH	

  6.  When a leave is detected (view change with 1 less member):
	 6.1.  Make a note of the Leaver (L)
	 6.2.  All nodes switch to a NEW_CH, which does not include L
	 6.3.  Nodes (L - replCount + 1), (L + 1) and (L + replCount) kick  
off a LeaveTask.
	 	 6.3.1.  Not everyone in the cluster need be involved in this  
rehash thanks to fixed CH positions.		

  7.  LeaveTask: This is PUSH based
	 7.1.  If an existing LeaveTask is running, the existing task should  
be cancelled first.
	 7.2.  if is_receiver(): Log all writes/incoming txs
	 7.3.  if is_receiver(): Incoming reads:
		7.3.1.  If the key was owned by L in OLD_CH and not by self, and in  
NEW_CH is owned by self, return UnsureResponse.  Else,  
SuccessfulResponse.
	 7.4.  if is_pusher():
	    7.4.1.  Iterate over all keys in data container.  If any key maps  
to L in OLD_CH, that key needs to be rehashed.  Call  
addState(stateMap, k).  See below for details of this algo.
         7.4.2.  Call PushStateCommand for each recipient in stateMap.
      7.5.  if is_receiver() and on receiving PushStateCommand
	 	7.5.1.  Drain tx log and apply, stop logging writes once drained.
	 	7.5.2.  Reverse 7.3.

  8.  A state map is a data structure representing state generated by  
each node to be pushed.
      8.1.  StateMap{
  				Map<Address, Map<Object, InternalCacheEntry>> state
			}

	 8.2.  addState(stateMap, k):
				OL = OLD_CH.locate(k)
				if (OL.contains(L) and ((position(L, OL) == last and  
current_address == L - 1) or (position(L, OL) != last and  
current_address == L + 1)) {
					stateMap.add(NEW_CH.locate(k) - OL, k)
					break
				}

  9.  Functions to determine who pushes and who receives state during  
a leave:
	9.1.  is_pusher(address): return address == (L + 1) or (L - 1)
	9.2.  is_receiver(address): return address == (L + 1) or (L +  
replCount)

Concurrency

1.  Deal with concurrent JOINs
	1.1.  Concurrent JOINs in completely disjoint parts of the hash space  
will have no overlap whatsoever.
	1.2.  Concurrent JOINs in the same hash space would work as well,  
since JOINs are controlled by the joiner.

2.  Deal with concurrent LEAVEs
	2.1.  Concurrent LEAVEs in completely disjoint parts of the hash  
space will have no overlap whatsoever.
	2.2.  Concurrent LEAVEs that do not affect the same nodes involved in  
providing or receiving state will not be a problem.
	2.3.  Concurrent LEAVES affecting the same nodes already providing or  
receiving state will interrupt the LeaveTasks on those nodes.  Leave  
tasks will need to clean up appropriately, which may involve issuing a  
"cleanup command" to ensure partially transmitted state is reconciled.

3.  Deal with concurrent LEAVE and JOIN
	3.1.  Concurrent JOIN and LEAVEs in completely disjoint parts of the  
hash space will have no overlap whatsoever.
	3.2.  FOr overlapping areas of hash space, this needs further thought.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20090717/702a74ce/attachment-0002.html