On Sun, Mar 18, 2012 at 12:51 PM, Bela Ban <bban(a)redhat.com> wrote:
Let's take a look at an example:
- Nodes are N1-N5, keys are 1-10
With the current state transfer, we'd serialize the 10 keys into a
buffer of (say) 10. That buffer is then sent to (say) all nodes, so we
have 5 messages of 10 = 50.
Now say the keys are only sent to the nodes which should store them:
- N1: 1, 5, 9
- N2: 1,2, 5,6, 9,10
- N3: 2,3 6,7, 10
- N4: 3,4 7,8
- N5: 4,8
This is a total of 5 messages of sizes 3, 6, 5, 4 and 2 = 20. The
biggest message is 6, and the total size of half of the current approach.
Wrt serialization: we could maintain a simple hashmap of the current
keys serialized, e.g.
- N1: we serialize keys 1, 5 and 9 and add them to the hashmap
- N2: we serialize keys 2, 6 and 10 and add them to the hashmap. Keys 1,
5 and 9 have already been serialized and the byte buffer of their values
can just be reused from the hashmap.
If we do this intelligently, we could sort this temp hashmap (we do know
the hashwheel after all, don't we), and remove keys/values that were
already sent, as they're not going to be used again. This would reduce
the max memory spike for the temp hashmap.
> So when the cluster is small and all the owners would receive most of
> the keys anyway, the savings in deserialization are offset by the
> increased costs in serialization.
You don't serialize the same key twice, see above. You just copy the
buffer into the resulting buffer, this is very efficient. With
scatter/gather (NIO2, JGroups 3.3), we would not even have to copy the
buffers, but pass all of them into the JGroups message.
Ok, you can avoid the multiple serialization of entries (although I
think that will need some pretty big changes to our command
serialization framework).
But you'll still have numOwners+1 copies of the entries in memory at
the same time on the command originator...
>> If we have a TX which touches a lot of keys, unless we break the key set
>> down into their associated targets, we would have to potentially
>> transfer a *big* state !
>>
>
> If we serialize the keys separately for each target, the originator
> will have to hold in memory numOwners * numKeys serialized entries
> until the messages are garbage collected.
That's 20, correct. But with the current approach, it is 50 (key set of
10 sent to all 5 nodes).
Actually no: we use the same buffer in all 5 messages, so it's just 10.
> With the current approach,
> we only have to hold in memory numKeys serialized entries on the
> originator - at the expense of extra load on the recipients.
You're sending the 10 serialized keys to 5 nodes, so you'll have 5
messages with 10 each in memory, that's worse then my suggested approach.
There is only one copy of the buffer on the originator, although there
are 5 copies on the network and on the recipients.