[infinispan-dev] Equivalence vs. MurmurHash3

Thu Nov 26 11:04:51 EST 2015

--
Galder Zamarreño
Infinispan, Red Hat

> On 23 Nov 2015, at 16:11, Dan Berindei <dan.berindei at gmail.com> wrote:
> 
> On Mon, Nov 23, 2015 at 2:16 PM, Radim Vansa <rvansa at redhat.com> wrote:
>> On 11/23/2015 01:07 PM, Sanne Grinovero wrote:
>>> +1
>>> 
>>> See also https://issues.jboss.org/browse/ISPN-3905 ; although I was
>>> mostly concerned on it allocating on a (very) hot path and didn't look
>>> at it in terms of compatibility modes.
>> Yes, due to compatibility we cannot remove the UTF-8 encoding from
>> MurmurHash3 since compatibility with clients (in other languages)
>> depends on this as well, though, we could theoretically merge encoding
>> and hashing into one function - UTF-8 encoder implementation looks quite
>> simple (60 lines of code?) - could be worth it even if used only for
>> server. However, my proposal was to remove that computation from
>> hot-code path completely.
>> 
>>> 
>>> Rather than xorring with "magic numbers" don't you think we
>>> Equivalence implementation should be able to rule on that?
>> 
>> We shouldn't require user to provide a pair of hashCode functions, I
>> don't think that would work well in practice. Though, we could make the
>> second function Java 8-default method (with return hashCode() ^
>> 0xWHATEVER), still allowing it to be overridable.
> 
> The JDK team learned long ago to use a spreader on top of
> user-supplied hashCode() implementations, as user-supplied hash codes
> are usually very clustered. In the case of strings, many times a
> common prefix makes up most of the key, and the hash codes of the keys
> are again clustered. A XOR with a magic value would definitely not
> help with the clustering issue, that's why java.util.HashMap doesn't
> use it.
> 
> Note that our consistent hashes map adjacent keys to the same segment:
> we use hash / buckets, whereas HashMap uses hash % buckets. So we
> require a better spread across the hash space than HashMap does, and
> because of that I think we really need MurmurHash3. Still, we could
> change it to work on the result of Equivalence.hashCode(Object),
> instead of dealing with the contents of byte[] and String directly,
> but maintaining compatibility with old clients may not be possible.
> 
> Regarding client-server divergences, I think we require compatibility
> mode to be enabled in order to access a cache both via HotRod and with
> the embedded API (because the server casts keys and values to byte[]).
> That means the distribution interceptor sees only the unmarshalled
> key, and getting the same hash code from the marshalled byte[] (on the
> client) and the unmarshalled Object (in the distribution interceptor)
> is going to be quite complex - either with a custom Object.hashCode()
> implementation, or with a custom Equivalence.hash(). I think the only
> way around this would be to change compatibility mode to store keys
> and values as byte[].

The current compatibility mode set up favours embedded mode since it keeps POJOs in memory, but based on feedback I've got it seems like the use case of compatibility where embedded is involve is not as common. Hence, moving to storing byte[] would be beneficial particularly for those using Hot Rod + REST compatibility modes.

Adrian already requested having the possibility to do this [1] and it's currently assigned to him :)

Cheers,

[1] https://issues.jboss.org/browse/ISPN-3663

> 
> Cheers
> Dan
> 
>> 
>> Radim
>> 
>>> 
>>> On 23 November 2015 at 10:26, Radim Vansa <rvansa at redhat.com> wrote:
>>>> Hi guys,
>>>> 
>>>> I have noticed that even in library mode we use MurmurHash3 to find out
>>>> the segment for particular key. For strings, this involves encoding into
>>>> UTF-8 and computation of hashCode, instead of just reading the cached
>>>> value in string. Common objects just remix the bits of hashCode. When
>>>> user provides custom Equivalence with non-default hashCode, it is not
>>>> used to determine the segment.
>>>> 
>>>> I think that in library mode we should rather use Equivalence.hashCode,
>>>> maybe XORed with some magic number so that there are less collisions in
>>>> DataContainer.
>>>> 
>>>> If we simply replaced the function in CH, we would break the case when
>>>> user starts HR server on top of library mode, as the clients expect key
>>>> location based on MurmurHash3. ATM user only has to set
>>>> AnyServerEquivalence for keys in DC; we would need to detect
>>>> configuration with server equivalence and set CH function to MH3, and
>>>> probably also log some warning if the equivalence is set to unknown
>>>> class and CH function is not specified.
>>>> 
>>>> WDYT?
>>>> 
>>>> Radim
>>>> 
>>>> --
>>>> Radim Vansa <rvansa at redhat.com>
>>>> JBoss Performance Team
>>>> 
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> --
>> Radim Vansa <rvansa at redhat.com>
>> JBoss Performance Team
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev