Re: [infinispan-dev] Equivalence vs. MurmurHash3

Monday, 23 November 2015

On Mon, Nov 23, 2015 at 4:59 PM, Radim Vansa <rvansa(a)redhat.com&gt; wrote:
...
 On 11/23/2015 04:11 PM, Dan Berindei wrote:
> On Mon, Nov 23, 2015 at 2:16 PM, Radim Vansa <rvansa(a)redhat.com&gt; wrote:
>> On 11/23/2015 01:07 PM, Sanne Grinovero wrote:
>>> +1
>>>
>>> See also https://issues.jboss.org/browse/ISPN-3905 ; although I was
>>> mostly concerned on it allocating on a (very) hot path and didn't look
>>> at it in terms of compatibility modes.
>> Yes, due to compatibility we cannot remove the UTF-8 encoding from
>> MurmurHash3 since compatibility with clients (in other languages)
>> depends on this as well, though, we could theoretically merge encoding
>> and hashing into one function - UTF-8 encoder implementation looks quite
>> simple (60 lines of code?) - could be worth it even if used only for
>> server. However, my proposal was to remove that computation from
>> hot-code path completely.
>>
>>> Rather than xorring with "magic numbers" don't you think we
>>> Equivalence implementation should be able to rule on that?
>> We shouldn't require user to provide a pair of hashCode functions, I
>> don't think that would work well in practice. Though, we could make the
>> second function Java 8-default method (with return hashCode() ^
>> 0xWHATEVER), still allowing it to be overridable.
> The JDK team learned long ago to use a spreader on top of
> user-supplied hashCode() implementations, as user-supplied hash codes
> are usually very clustered. In the case of strings, many times a
> common prefix makes up most of the key, and the hash codes of the keys
> are again clustered. A XOR with a magic value would definitely not
> help with the clustering issue, that's why java.util.HashMap doesn't
> use it.
>
> Note that our consistent hashes map adjacent keys to the same segment:
> we use hash / buckets, whereas HashMap uses hash % buckets. So we
> require a better spread across the hash space than HashMap does, and
> because of that I think we really need MurmurHash3. Still, we could
> change it to work on the result of Equivalence.hashCode(Object),
> instead of dealing with the contents of byte[] and String directly,
> but maintaining compatibility with old clients may not be possible.

 Okay, I don't mind using the int32-spreader, that's just a better
 version of XOR with a couple more bitshifts. But do we need two
 spreaders, one for CH and another for DC, or is it sufficient that CH
 will use division and DC modulo? 
I don't know, I've never tried it :)

...

>
> Regarding client-server divergences, I think we require compatibility
> mode to be enabled in order to access a cache both via HotRod and with
> the embedded API (because the server casts keys and values to byte[]).
> That means the distribution interceptor sees only the unmarshalled
> key, and getting the same hash code from the marshalled byte[] (on the
> client) and the unmarshalled Object (in the distribution interceptor)
> is going to be quite complex - either with a custom Object.hashCode()
> implementation, or with a custom Equivalence.hash(). I think the only
> way around this would be to change compatibility mode to store keys
> and values as byte[].

 I was not talking about combined HR + embedded access, but HR-only
 access but without the server, therefore configuring & starting cache
 manager + HR server on their own. 
Just curious, what's the use case?

...
 And I've not realized that in remote case the server deals only
with
 byte[]s, so in non-compatibility mode we would just use

 o instanceof byte[] ? MH3.hash((byte[]) o) :
 MH3.hash(equivalence.hashCode(o)) 
But if we say we don't support non-byte[] keys in a server without
compatibility mode enabled (and I believe we do), we don't need the
second part at all.

...

 (there is no point in letting Equivalence compute the hashcode from
 byte[] and then just spread it). In compatibility case, we need to keep
 the existing behaviour, because they expect us to do the unmarshall ->
 encode to UTF-8 -> MH3. But that's already a configuration switch, so no
 heuristics are needed. 
Looks like targetting in compatibility mode is already broken [1], so
keeping the existing behaviour may not be necessary after all.

[1] https://issues.jboss.org/browse/ISPN-5981

Dan

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] Equivalence vs. MurmurHash3