[infinispan-dev] Let me understand DIST

Thu Mar 22 10:10:21 EDT 2012

On Sun, Mar 18, 2012 at 12:51 PM, Bela Ban <bban at redhat.com> wrote:
>
> Let's take a look at an example:
>
> - Nodes are N1-N5, keys are 1-10
>
> With the current state transfer, we'd serialize the 10 keys into a
> buffer of (say) 10. That buffer is then sent to (say) all nodes, so we
> have 5 messages of 10 = 50.
>
> Now say the keys are only sent to the nodes which should store them:
> - N1: 1, 5, 9
> - N2: 1,2, 5,6, 9,10
> - N3: 2,3 6,7, 10
> - N4: 3,4 7,8
> - N5: 4,8
>
> This is a total of 5 messages of sizes 3, 6, 5, 4 and 2 = 20. The
> biggest message is 6, and the total size of half of the current approach.
>
> Wrt serialization: we could maintain a simple hashmap of the current
> keys serialized, e.g.
> - N1: we serialize keys 1, 5 and 9 and add them to the hashmap
> - N2: we serialize keys 2, 6 and 10 and add them to the hashmap. Keys 1,
> 5 and 9 have already been serialized and the byte buffer of their values
> can just be reused from the hashmap.
>
>
> If we do this intelligently, we could sort this temp hashmap (we do know
> the hashwheel after all, don't we), and remove keys/values that were
> already sent, as they're not going to be used again. This would reduce
> the max memory spike for the temp hashmap.
>
>
>> So when the cluster is small and all the owners would receive most of
>> the keys anyway, the savings in deserialization are offset by the
>> increased costs in serialization.
>
>
> You don't serialize the same key twice, see above. You just copy the
> buffer into the resulting buffer, this is very efficient. With
> scatter/gather (NIO2, JGroups 3.3), we would not even have to copy the
> buffers, but pass all of them into the JGroups message.
>

Ok, you can avoid the multiple serialization of entries (although I
think that will need some pretty big changes to our command
serialization framework).
But you'll still have numOwners+1 copies of the entries in memory at
the same time on the command originator...

>
>>> If we have a TX which touches a lot of keys, unless we break the key set
>>> down into their associated targets, we would have to potentially
>>> transfer a *big* state !
>>>
>>
>> If we serialize the keys separately for each target, the originator
>> will have to hold in memory numOwners * numKeys serialized entries
>> until the messages are garbage collected.
>
>
> That's 20, correct. But with the current approach, it is 50 (key set of
> 10 sent to all 5 nodes).
>

Actually no: we use the same buffer in all 5 messages, so it's just 10.

>
>>  With the current approach,
>> we only have to hold in memory numKeys serialized entries on the
>> originator - at the expense of extra load on the recipients.
>
>
> You're sending the 10 serialized keys to 5 nodes, so you'll have 5
> messages with 10 each in memory, that's worse then my suggested approach.
>

There is only one copy of the buffer on the originator, although there
are 5 copies on the network and on the recipients.