[infinispan-dev] Let me understand DIST

Sun Mar 18 06:51:05 EDT 2012

On 3/16/12 12:42 PM, Dan Berindei wrote:

>> In my experience, serialization is never the bottleneck, it is
>> *de-serialization*, so this shouldn't be a problem. Also, I don't see
>> why serialization of keys [1..9] shouldn't take roughly the same time as
>> serialization of keys [1..3], [4..6], [7..9].
>>
>
> Bela, we would split the keys into [1..3], [4..6], [7..9] only if numOwners==1.
> With numOwners==2, it would more likely be something like this:
> [1..6], [4..6], [7..9], [7-9, 1..3]. Because each key has 2 owners, it
> has to be serialized twice (and kept in memory on the originator in 3
> copies).
> With numOwners==3 the serialization cost will be triple, both in CPU
> and in memory usage.
> And so on...

Let's take a look at an example:

- Nodes are N1-N5, keys are 1-10

With the current state transfer, we'd serialize the 10 keys into a 
buffer of (say) 10. That buffer is then sent to (say) all nodes, so we 
have 5 messages of 10 = 50.

Now say the keys are only sent to the nodes which should store them:
- N1: 1, 5, 9
- N2: 1,2, 5,6, 9,10
- N3: 2,3 6,7, 10
- N4: 3,4 7,8
- N5: 4,8

This is a total of 5 messages of sizes 3, 6, 5, 4 and 2 = 20. The 
biggest message is 6, and the total size of half of the current approach.

Wrt serialization: we could maintain a simple hashmap of the current 
keys serialized, e.g.
- N1: we serialize keys 1, 5 and 9 and add them to the hashmap
- N2: we serialize keys 2, 6 and 10 and add them to the hashmap. Keys 1, 
5 and 9 have already been serialized and the byte buffer of their values 
can just be reused from the hashmap.

If we do this intelligently, we could sort this temp hashmap (we do know 
the hashwheel after all, don't we), and remove keys/values that were 
already sent, as they're not going to be used again. This would reduce 
the max memory spike for the temp hashmap.

> So when the cluster is small and all the owners would receive most of
> the keys anyway, the savings in deserialization are offset by the
> increased costs in serialization.

You don't serialize the same key twice, see above. You just copy the 
buffer into the resulting buffer, this is very efficient. With 
scatter/gather (NIO2, JGroups 3.3), we would not even have to copy the 
buffers, but pass all of them into the JGroups message.

>> If we have a TX which touches a lot of keys, unless we break the key set
>> down into their associated targets, we would have to potentially
>> transfer a *big* state !
>>
>
> If we serialize the keys separately for each target, the originator
> will have to hold in memory numOwners * numKeys serialized entries
> until the messages are garbage collected.

That's 20, correct. But with the current approach, it is 50 (key set of 
10 sent to all 5 nodes).

>  With the current approach,
> we only have to hold in memory numKeys serialized entries on the
> originator - at the expense of extra load on the recipients.

You're sending the 10 serialized keys to 5 nodes, so you'll have 5 
messages with 10 each in memory, that's worse then my suggested approach.

Hmm, regardless of the size of the messages held in memory, there *are* 
messages that are in memory that should be removed. If we use UNICAST, 
that's not a problem, because as soon as an ack for a message M has been 
received by the sender, M is removed from the retransmission table and 
can be GC'ed. However, with UNICAST2, it's different: stability only 
happens after a number of bytes have been received, or a timeout kicked in.
There is a way though to trigger stability programmatically, and we 
should do this after events (e.g. state transfer) in which we expect 
large messages to be in the retransmission table. If you guys go with 
UNICAST2, let me show you how to do this (contact me offline)...
Cheers,

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)