[infinispan-dev] Equivalence vs. MurmurHash3

Mon Nov 23 13:57:09 EST 2015

On Mon, Nov 23, 2015 at 4:59 PM, Radim Vansa <rvansa at redhat.com> wrote:
> On 11/23/2015 04:11 PM, Dan Berindei wrote:
>> On Mon, Nov 23, 2015 at 2:16 PM, Radim Vansa <rvansa at redhat.com> wrote:
>>> On 11/23/2015 01:07 PM, Sanne Grinovero wrote:
>>>> +1
>>>>
>>>> See also https://issues.jboss.org/browse/ISPN-3905 ; although I was
>>>> mostly concerned on it allocating on a (very) hot path and didn't look
>>>> at it in terms of compatibility modes.
>>> Yes, due to compatibility we cannot remove the UTF-8 encoding from
>>> MurmurHash3 since compatibility with clients (in other languages)
>>> depends on this as well, though, we could theoretically merge encoding
>>> and hashing into one function - UTF-8 encoder implementation looks quite
>>> simple (60 lines of code?) - could be worth it even if used only for
>>> server. However, my proposal was to remove that computation from
>>> hot-code path completely.
>>>
>>>> Rather than xorring with "magic numbers" don't you think we
>>>> Equivalence implementation should be able to rule on that?
>>> We shouldn't require user to provide a pair of hashCode functions, I
>>> don't think that would work well in practice. Though, we could make the
>>> second function Java 8-default method (with return hashCode() ^
>>> 0xWHATEVER), still allowing it to be overridable.
>> The JDK team learned long ago to use a spreader on top of
>> user-supplied hashCode() implementations, as user-supplied hash codes
>> are usually very clustered. In the case of strings, many times a
>> common prefix makes up most of the key, and the hash codes of the keys
>> are again clustered. A XOR with a magic value would definitely not
>> help with the clustering issue, that's why java.util.HashMap doesn't
>> use it.
>>
>> Note that our consistent hashes map adjacent keys to the same segment:
>> we use hash / buckets, whereas HashMap uses hash % buckets. So we
>> require a better spread across the hash space than HashMap does, and
>> because of that I think we really need MurmurHash3. Still, we could
>> change it to work on the result of Equivalence.hashCode(Object),
>> instead of dealing with the contents of byte[] and String directly,
>> but maintaining compatibility with old clients may not be possible.
>
> Okay, I don't mind using the int32-spreader, that's just a better
> version of XOR with a couple more bitshifts. But do we need two
> spreaders, one for CH and another for DC, or is it sufficient that CH
> will use division and DC modulo?

I don't know, I've never tried it :)

>
>>
>> Regarding client-server divergences, I think we require compatibility
>> mode to be enabled in order to access a cache both via HotRod and with
>> the embedded API (because the server casts keys and values to byte[]).
>> That means the distribution interceptor sees only the unmarshalled
>> key, and getting the same hash code from the marshalled byte[] (on the
>> client) and the unmarshalled Object (in the distribution interceptor)
>> is going to be quite complex - either with a custom Object.hashCode()
>> implementation, or with a custom Equivalence.hash(). I think the only
>> way around this would be to change compatibility mode to store keys
>> and values as byte[].
>
> I was not talking about combined HR + embedded access, but HR-only
> access but without the server, therefore configuring & starting cache
> manager + HR server on their own.

Just curious, what's the use case?

> And I've not realized that in remote case the server deals only with
> byte[]s, so in non-compatibility mode we would just use
>
> o instanceof byte[] ? MH3.hash((byte[]) o) :
> MH3.hash(equivalence.hashCode(o))

But if we say we don't support non-byte[] keys in a server without
compatibility mode enabled (and I believe we do), we don't need the
second part at all.

>
> (there is no point in letting Equivalence compute the hashcode from
> byte[] and then just spread it). In compatibility case, we need to keep
> the existing behaviour, because they expect us to do the unmarshall ->
> encode to UTF-8 -> MH3. But that's already a configuration switch, so no
> heuristics are needed.

Looks like targetting in compatibility mode is already broken [1], so
keeping the existing behaviour may not be necessary after all.

[1] https://issues.jboss.org/browse/ISPN-5981

Dan