[infinispan-dev] Equivalence vs. MurmurHash3

Mon Nov 23 10:11:06 EST 2015

On Mon, Nov 23, 2015 at 2:16 PM, Radim Vansa <rvansa at redhat.com> wrote:
> On 11/23/2015 01:07 PM, Sanne Grinovero wrote:
>> +1
>>
>> See also https://issues.jboss.org/browse/ISPN-3905 ; although I was
>> mostly concerned on it allocating on a (very) hot path and didn't look
>> at it in terms of compatibility modes.
> Yes, due to compatibility we cannot remove the UTF-8 encoding from
> MurmurHash3 since compatibility with clients (in other languages)
> depends on this as well, though, we could theoretically merge encoding
> and hashing into one function - UTF-8 encoder implementation looks quite
> simple (60 lines of code?) - could be worth it even if used only for
> server. However, my proposal was to remove that computation from
> hot-code path completely.
>
>>
>> Rather than xorring with "magic numbers" don't you think we
>> Equivalence implementation should be able to rule on that?
>
> We shouldn't require user to provide a pair of hashCode functions, I
> don't think that would work well in practice. Though, we could make the
> second function Java 8-default method (with return hashCode() ^
> 0xWHATEVER), still allowing it to be overridable.

The JDK team learned long ago to use a spreader on top of
user-supplied hashCode() implementations, as user-supplied hash codes
are usually very clustered. In the case of strings, many times a
common prefix makes up most of the key, and the hash codes of the keys
are again clustered. A XOR with a magic value would definitely not
help with the clustering issue, that's why java.util.HashMap doesn't
use it.

Note that our consistent hashes map adjacent keys to the same segment:
we use hash / buckets, whereas HashMap uses hash % buckets. So we
require a better spread across the hash space than HashMap does, and
because of that I think we really need MurmurHash3. Still, we could
change it to work on the result of Equivalence.hashCode(Object),
instead of dealing with the contents of byte[] and String directly,
but maintaining compatibility with old clients may not be possible.

Regarding client-server divergences, I think we require compatibility
mode to be enabled in order to access a cache both via HotRod and with
the embedded API (because the server casts keys and values to byte[]).
That means the distribution interceptor sees only the unmarshalled
key, and getting the same hash code from the marshalled byte[] (on the
client) and the unmarshalled Object (in the distribution interceptor)
is going to be quite complex - either with a custom Object.hashCode()
implementation, or with a custom Equivalence.hash(). I think the only
way around this would be to change compatibility mode to store keys
and values as byte[].

Cheers
Dan

>
> Radim
>
>>
>> On 23 November 2015 at 10:26, Radim Vansa <rvansa at redhat.com> wrote:
>>> Hi guys,
>>>
>>> I have noticed that even in library mode we use MurmurHash3 to find out
>>> the segment for particular key. For strings, this involves encoding into
>>> UTF-8 and computation of hashCode, instead of just reading the cached
>>> value in string. Common objects just remix the bits of hashCode. When
>>> user provides custom Equivalence with non-default hashCode, it is not
>>> used to determine the segment.
>>>
>>> I think that in library mode we should rather use Equivalence.hashCode,
>>> maybe XORed with some magic number so that there are less collisions in
>>> DataContainer.
>>>
>>> If we simply replaced the function in CH, we would break the case when
>>> user starts HR server on top of library mode, as the clients expect key
>>> location based on MurmurHash3. ATM user only has to set
>>> AnyServerEquivalence for keys in DC; we would need to detect
>>> configuration with server equivalence and set CH function to MH3, and
>>> probably also log some warning if the equivalence is set to unknown
>>> class and CH function is not specified.
>>>
>>> WDYT?
>>>
>>> Radim
>>>
>>> --
>>> Radim Vansa <rvansa at redhat.com>
>>> JBoss Performance Team
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Radim Vansa <rvansa at redhat.com>
> JBoss Performance Team
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev