[infinispan-dev] Lock amortization preliminary performance numbers

Tue Feb 2 19:49:00 EST 2010

Vladimir,

I now have a workload where I have a hotspot on our cache which is 8% of the total time.  I am going to use this to test BCHM under a realistic workload.

One thing which I like about your design is that by placing the array buffering the operations for a thread within the Segment, you are using the lock guarding the Segment to control updates to that array.  While this has less potential throughput than a truely thread-local design (which is non-blocking until the thread-local array is full), it seems that the array buffering the updates can not "escape" under your design.  Was this intentional?

The danger with a true thread-local design is that a thread can come in and do some work, get some updates buffered in its thread-local array, and then never visit again.  In this case those updates would remain buffered on the thread and would not in fact cause the access order to be updated in a timely manner.  Worse, if you are relying on WeakReference semantics, the buffered updates would remain strongly reachable and the corresponding objects would be wired into the cache.

I've worked around this issue in a different context where it made sense to scope the buffers to an owning object (a B+Tree instance).  In that case, when the B+Tree container was closed, all buffered updates were discarded.  This nicely eliminated the problems with "escaping" threads.

However, I like your approach better.

Bryan

PS: I see a comment on LRU#onEntryHit() indicating that it is invoked without holding the Segment lock.  Is that true only when BCHM#get() is invoked?

PPS: Also, can you expand on the role of the LRU#accessQueue vs LRU#lruQueue?  Is there anything which corresponds to the total LRU order (after batching updates through the Segment)?

________________________________
From: infinispan-dev-bounces at lists.jboss.org [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Bryan Thompson
Sent: Wednesday, January 27, 2010 1:00 PM
To: infinispan -Dev List
Subject: Re: [infinispan-dev] Lock amortization preliminary performance numbers

Vladimir,

I recently replaced a synchronized ring buffer behind a ConcurrentHashMap with weak values with a variation of the ring buffer which buffers updates in thread-local arrays and the batches them through the lock.  For the specific application (a cache of frequently resolved terms in a dictionary) contention went to zero.  We use a very similar design with a ring buffer to hold onto recently touched B+Tree nodes and leaves.  This was a hot spot for concurrent readers as they needed to synchronize when adding the reference to the ring buffer.  I made a similar change, to use a variant of the ring buffer with thread-local buffers and batching through the lock, and that contention has also disappeared completely.  I have been using between a 64 and 128 element per-thread array with barging into the lock at 50% of the buffer capacity.

So, overall, this approach is working out very nicely for us in combination with WeakReferences.  One drawback of the approach is that hard references remain in the thread-local arrays until they are batched through the lock.  Depending on how threads are managed, it is possible for hard references to remain in the thread local queues long past their last touch.  For this reason we do not use this technique when the cache contains expensive resources where we need a better guarantee of timely discard when they resource is no longer in use by the application.  It might be possible to address this concern with a protocol to attach and detach the thread-local buffers from threads on their way into / out of a thread pool.

I have not yet tried to use this approach in our main data record cache (caching data records read from the disk in memory).  We have a hot spot there still, but I am quite certain that we will be able to address it using similar techniques.  I will try out the infinispan BufferedConcurrentHashMap in that context soon.

Thanks,

Bryan

________________________________
From: infinispan-dev-bounces at lists.jboss.org [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Vladimir Blagojevic
Sent: Wednesday, January 27, 2010 10:30 AM
To: infinispan -Dev List
Subject: [infinispan-dev] Lock amortization preliminary performance numbers

Hi all,

As you probably recall I am working on a LIRS eviction algorithm. But before we get there, Manik and I agreed that I push in a direction of implementing ConcurrentHashMap(CHM) variant that will do lock amortization and eviction per segment of CHM. So far I have implemented LRU eviction so we can verify feasibility of this approach rather than eviction algorithm itself. We are hoping that eviction algorithm precision will not suffer if we do eviction per segment and you all are familiar with lock striping benefits of segments in CHM.

So, I've cobbled up a first test to compare eviction and lock amortized enabled CHM (BCHM) with regular CHM and synchronized HashMap. The test is actually based on already existing test [1] used to measure performance of various DataContainers. I've changed it slightly to measure Map directly instead of DataContainer. The test launches in cohort 48 reader, 4 writer and 1 remover threads. All operations are randomized; readers execute map.get(R.nextInt(NUM_KEYS), writers map.put(R.nextInt(NUM_KEYS), "value"), and finally removers execute map.remove(R.nextInt(NUM_KEYS). NUM_KEYS was set to 10K. Each thread does in a loop 1K ops.

Initial capacity for CHM and HashMap were set to 1K, max capacity for eviction and lock amortized enabled CHM was set to 256; therefore BCHM has to do a lot of evictions which is evident in map final size listed below.

Size = 9999
Performance for container ConcurrentHashMap
Average get ops/ms 338
Average put ops/ms 87
Average remove ops/ms 171

Size = 193
Performance for container BufferedConcurrentHashMap
Average get ops/ms 322
Average put ops/ms 32
Average remove ops/ms 74

Size = 8340
Performance for container SynchronizedMap
Average get ops/ms 67
Average put ops/ms 45
Average remove ops/ms 63

If I remove lock amortization for BufferedConcurrentHashMap, that is if we attempt eviction on every get/put/remove, performance of put/remove for BCHM goes to zero basically! As far as I can interpret these results, BCHM get operation performance does not suffer at all in comparison with CHM as it does for single lock HashMap. Predictably, for single lock HashMap each type of operation takes on avg almost the same amount of time. We pay a hit in BCHM put/remove operations in comparison with CHM but the numbers are promising, I think. If  you have comments, or even suggestions on how to test these map variants in a different, possibly more insightful approach, speak up!

Cheers,
Vladimir

[1] http://fisheye.jboss.org/browse/~raw,r=1264/Infinispan/trunk/core/src/test/java/org/infinispan/stress/DataContainerStressTest.java

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20100202/ffdfd40b/attachment-0002.html