[infinispan-dev] Let me understand DIST
Bela Ban
bban at redhat.com
Sun Mar 18 06:51:05 EDT 2012
On 3/16/12 12:42 PM, Dan Berindei wrote:
>> In my experience, serialization is never the bottleneck, it is
>> *de-serialization*, so this shouldn't be a problem. Also, I don't see
>> why serialization of keys [1..9] shouldn't take roughly the same time as
>> serialization of keys [1..3], [4..6], [7..9].
>>
>
> Bela, we would split the keys into [1..3], [4..6], [7..9] only if numOwners==1.
> With numOwners==2, it would more likely be something like this:
> [1..6], [4..6], [7..9], [7-9, 1..3]. Because each key has 2 owners, it
> has to be serialized twice (and kept in memory on the originator in 3
> copies).
> With numOwners==3 the serialization cost will be triple, both in CPU
> and in memory usage.
> And so on...
Let's take a look at an example:
- Nodes are N1-N5, keys are 1-10
With the current state transfer, we'd serialize the 10 keys into a
buffer of (say) 10. That buffer is then sent to (say) all nodes, so we
have 5 messages of 10 = 50.
Now say the keys are only sent to the nodes which should store them:
- N1: 1, 5, 9
- N2: 1,2, 5,6, 9,10
- N3: 2,3 6,7, 10
- N4: 3,4 7,8
- N5: 4,8
This is a total of 5 messages of sizes 3, 6, 5, 4 and 2 = 20. The
biggest message is 6, and the total size of half of the current approach.
Wrt serialization: we could maintain a simple hashmap of the current
keys serialized, e.g.
- N1: we serialize keys 1, 5 and 9 and add them to the hashmap
- N2: we serialize keys 2, 6 and 10 and add them to the hashmap. Keys 1,
5 and 9 have already been serialized and the byte buffer of their values
can just be reused from the hashmap.
If we do this intelligently, we could sort this temp hashmap (we do know
the hashwheel after all, don't we), and remove keys/values that were
already sent, as they're not going to be used again. This would reduce
the max memory spike for the temp hashmap.
> So when the cluster is small and all the owners would receive most of
> the keys anyway, the savings in deserialization are offset by the
> increased costs in serialization.
You don't serialize the same key twice, see above. You just copy the
buffer into the resulting buffer, this is very efficient. With
scatter/gather (NIO2, JGroups 3.3), we would not even have to copy the
buffers, but pass all of them into the JGroups message.
>> If we have a TX which touches a lot of keys, unless we break the key set
>> down into their associated targets, we would have to potentially
>> transfer a *big* state !
>>
>
> If we serialize the keys separately for each target, the originator
> will have to hold in memory numOwners * numKeys serialized entries
> until the messages are garbage collected.
That's 20, correct. But with the current approach, it is 50 (key set of
10 sent to all 5 nodes).
> With the current approach,
> we only have to hold in memory numKeys serialized entries on the
> originator - at the expense of extra load on the recipients.
You're sending the 10 serialized keys to 5 nodes, so you'll have 5
messages with 10 each in memory, that's worse then my suggested approach.
Hmm, regardless of the size of the messages held in memory, there *are*
messages that are in memory that should be removed. If we use UNICAST,
that's not a problem, because as soon as an ack for a message M has been
received by the sender, M is removed from the retransmission table and
can be GC'ed. However, with UNICAST2, it's different: stability only
happens after a number of bytes have been received, or a timeout kicked in.
There is a way though to trigger stability programmatically, and we
should do this after events (e.g. state transfer) in which we expect
large messages to be in the retransmission table. If you guys go with
UNICAST2, let me show you how to do this (contact me offline)...
Cheers,
--
Bela Ban, JGroups lead (http://www.jgroups.org)
More information about the infinispan-dev
mailing list