[infinispan-dev] Equivalence vs. MurmurHash3

Mon Nov 23 10:59:48 EST 2015

On 11/23/2015 04:11 PM, Dan Berindei wrote:
> On Mon, Nov 23, 2015 at 2:16 PM, Radim Vansa <rvansa at redhat.com> wrote:
>> On 11/23/2015 01:07 PM, Sanne Grinovero wrote:
>>> +1
>>>
>>> See also https://issues.jboss.org/browse/ISPN-3905 ; although I was
>>> mostly concerned on it allocating on a (very) hot path and didn't look
>>> at it in terms of compatibility modes.
>> Yes, due to compatibility we cannot remove the UTF-8 encoding from
>> MurmurHash3 since compatibility with clients (in other languages)
>> depends on this as well, though, we could theoretically merge encoding
>> and hashing into one function - UTF-8 encoder implementation looks quite
>> simple (60 lines of code?) - could be worth it even if used only for
>> server. However, my proposal was to remove that computation from
>> hot-code path completely.
>>
>>> Rather than xorring with "magic numbers" don't you think we
>>> Equivalence implementation should be able to rule on that?
>> We shouldn't require user to provide a pair of hashCode functions, I
>> don't think that would work well in practice. Though, we could make the
>> second function Java 8-default method (with return hashCode() ^
>> 0xWHATEVER), still allowing it to be overridable.
> The JDK team learned long ago to use a spreader on top of
> user-supplied hashCode() implementations, as user-supplied hash codes
> are usually very clustered. In the case of strings, many times a
> common prefix makes up most of the key, and the hash codes of the keys
> are again clustered. A XOR with a magic value would definitely not
> help with the clustering issue, that's why java.util.HashMap doesn't
> use it.
>
> Note that our consistent hashes map adjacent keys to the same segment:
> we use hash / buckets, whereas HashMap uses hash % buckets. So we
> require a better spread across the hash space than HashMap does, and
> because of that I think we really need MurmurHash3. Still, we could
> change it to work on the result of Equivalence.hashCode(Object),
> instead of dealing with the contents of byte[] and String directly,
> but maintaining compatibility with old clients may not be possible.

Okay, I don't mind using the int32-spreader, that's just a better 
version of XOR with a couple more bitshifts. But do we need two 
spreaders, one for CH and another for DC, or is it sufficient that CH 
will use division and DC modulo?

>
> Regarding client-server divergences, I think we require compatibility
> mode to be enabled in order to access a cache both via HotRod and with
> the embedded API (because the server casts keys and values to byte[]).
> That means the distribution interceptor sees only the unmarshalled
> key, and getting the same hash code from the marshalled byte[] (on the
> client) and the unmarshalled Object (in the distribution interceptor)
> is going to be quite complex - either with a custom Object.hashCode()
> implementation, or with a custom Equivalence.hash(). I think the only
> way around this would be to change compatibility mode to store keys
> and values as byte[].

I was not talking about combined HR + embedded access, but HR-only 
access but without the server, therefore configuring & starting cache 
manager + HR server on their own.
And I've not realized that in remote case the server deals only with 
byte[]s, so in non-compatibility mode we would just use

o instanceof byte[] ? MH3.hash((byte[]) o) : 
MH3.hash(equivalence.hashCode(o))

(there is no point in letting Equivalence compute the hashcode from 
byte[] and then just spread it). In compatibility case, we need to keep 
the existing behaviour, because they expect us to do the unmarshall -> 
encode to UTF-8 -> MH3. But that's already a configuration switch, so no 
heuristics are needed.

Radim

>
> Cheers
> Dan
>
>> Radim
>>
>>> On 23 November 2015 at 10:26, Radim Vansa <rvansa at redhat.com> wrote:
>>>> Hi guys,
>>>>
>>>> I have noticed that even in library mode we use MurmurHash3 to find out
>>>> the segment for particular key. For strings, this involves encoding into
>>>> UTF-8 and computation of hashCode, instead of just reading the cached
>>>> value in string. Common objects just remix the bits of hashCode. When
>>>> user provides custom Equivalence with non-default hashCode, it is not
>>>> used to determine the segment.
>>>>
>>>> I think that in library mode we should rather use Equivalence.hashCode,
>>>> maybe XORed with some magic number so that there are less collisions in
>>>> DataContainer.
>>>>
>>>> If we simply replaced the function in CH, we would break the case when
>>>> user starts HR server on top of library mode, as the clients expect key
>>>> location based on MurmurHash3. ATM user only has to set
>>>> AnyServerEquivalence for keys in DC; we would need to detect
>>>> configuration with server equivalence and set CH function to MH3, and
>>>> probably also log some warning if the equivalence is set to unknown
>>>> class and CH function is not specified.
>>>>
>>>> WDYT?
>>>>
>>>> Radim
>>>>
>>>> --
>>>> Radim Vansa <rvansa at redhat.com>
>>>> JBoss Performance Team
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Radim Vansa <rvansa at redhat.com>
>> JBoss Performance Team
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Radim Vansa <rvansa at redhat.com>
JBoss Performance Team