[infinispan-dev] Cachestores performance

Thu Jun 27 08:37:43 EDT 2013

----- Original Message -----
| From: "Galder Zamarreño" <galder at redhat.com>
| To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
| Sent: Thursday, June 27, 2013 1:52:11 PM
| Subject: Re: [infinispan-dev] Cachestores performance
| 
| > As for Karsten's FCS implementation, I too have issues with the key set and
| > value offsets being solely in memory.  However I think that could be
| > improved by storing only a certain number of keys/offsets in memory, and
| > flushing the rest to disk again into an index file.
| 
| ^ Karsten's implementation makes this relatively easy to achieve because it
| already keeps this mapping in a LinkedHashMap (with a given max entries
| limit [1]) assuming removeEldestEntry() is overriden to flush to disk older
| entries. Some extra logic would be needed to bring back data from the disk
| too… but your suggestion below is also quite interesting...

I certainly wouldn't call this easy task, because the most problematic part is what we will do when the whole entry (both key and value) are gone from memory and we want to read them - that requires keeping some searchable structure on-disk. And that's the hard stuff.

| 
| > I believe LevelDB follows a similar design, but I think Karsten's FCS will
| > perform better than LevelDB since it doesn't attempt to maintain a sorted
| > structure on disk.
| 
| ^ In-memory, the structure can optionally be ordered if it's bound [1],
| otherwise it's just a normal map. How would be store it at the disk level?
| B+ tree with hashes of keys and then linked lists?

Before choosing "I love B#@& trees, let's use B#@& trees!", I'd find out what requirements do we have for the structure. I believe that the index itself should not be considered persistent, as it can be rebuilt when preloading the data (sequentially reading the data is fast, therefore we can afford do this indexing preload), the reason of the index being on-disk is that we don't have enough memory to store all keys, or even key hashes. Therefore it does not have to be updated synchronously with the writes. It should be mostly read-optimized then, because that's the thing where we need synchronous access to this structure.

| 
| > One approach to maintaining keys and offsets in memory could be a
| > WeakReference that points to the key stored in the in-memory
| > DataContainer.  Once evicted from the DC, then the CacheStore impl would
| > need to fetch the key again from the index file before looking up the
| > value in the actual store.
| 
| ^ Hmmm, interesting idea… has the potential to safe the memory space by not
| having to keep that extra data structure in the cache store.

You mean to mix the DataContainer with xCacheEntry implementation and the cache store implementation? Is that possible from design perspective?
Speaking about different kind of references, we may even optimize not-well-tuned eviction by SoftReferences, so that even if the entry was evicted from main DataContainer, we'd keep the value referenced from the cache-store (and this does not have to be loaded from disk if referenced before garbage collection). But such thought may be premature optimization. For having eviction managed in relation with GC we should rather combine this with PhantomReferences, where entries would be written to cache upon finalization.

| 
| > This way we have hot items always in memory, semi-hot items with offsets in
| > memory and values on disk, and cold items needing to be read off disk
| > entirely (both offset and value).  Also for write-through and
| > write-behind, as long as the item is hot or warm (key and offset in
| > memory), writing will be pretty fast.
| 
| My worry about Karsten's impl is writing actually. If you look at the last
| performance numbers in [2], where we see the performance difference of
| force=true and force=false in Karsten's cache store compared with LevelDB
| JNI, you see that force=false is fastest, then JNI LevelDB, and the
| force=true. Me wonders what kind of write guarantees LevelDB JNI provides
| (and the JAVA version)...

Just for clarification: the fast implementation is without force at all, the slower is with force(false). Force(true) means updating metadata (such as access times?) which is not required for cache-store.
But the numbers suggest that the random access with syncing is really not a good option, and that we should rather use the temporary append-only log, which would be persisted into structured DB by different thread (as LevelDB does, I suppose).

Thinking about all the levels and cache structures optimizing the read access, I can see four levels of search structures: key + value (usual DataContainer), key + offset, hash + offset, all on disk. The "hash + offset" may seem superflous but for some use-cases with big keys it may be worth sparing a few disk look-ups.

Radim

| > 
| > On 27 Jun 2013, at 10:33, Radim Vansa <rvansa at redhat.com> wrote:
| > 
| >> Oops, by the cache store I mean the previously-superfast
| >> KarstenFileCacheStore implementation.
| >> 
| >> ----- Original Message -----
| >> | From: "Radim Vansa" <rvansa at redhat.com>
| >> | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
| >> | Sent: Thursday, June 27, 2013 11:30:53 AM
| >> | Subject: Re: [infinispan-dev] Cachestores performance
| >> | 
| >> | I have added FileChannel.force(false) flushes after all write operations
| >> | in
| >> | the cache store, and now the comparison is also updated with these
| >> | values.
| >> | 
| >> | Radim
| >> | 
| >> | ----- Original Message -----
| >> | | From: "Radim Vansa" <rvansa at redhat.com>
| >> | | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
| >> | | Sent: Thursday, June 27, 2013 8:54:25 AM
| >> | | Subject: Re: [infinispan-dev] Cachestores performance
| >> | | 
| >> | | Yep, write-through. LevelDB JAVA used FileChannelTable implementation
| >> | | (-Dleveldb.mmap), because Mmaping is not implemented very well and
| >> | | causes
| >> | | JVM crashes (I believe it's because of calling non-public API via
| >> | | reflection
| >> | | - I've found post from the Oracle JVM guys discouraging the particular
| >> | | trick
| >> | | it uses). After writing the record to the log, it calls
| >> | | FileChannel.force(true), therefore, it should be really on the disc by
| >> | | that
| >> | | moment.
| >> | | I have not looked into the JNI implementation but I expect the same.
| >> | | 
| >> | | By the way, I have updated [1] with numbers when running on more data
| >> | | (2 GB
| >> | | instead of 100 MB). I won't retype it here, so look there. The
| >> | | performance
| >> | | is much lower.
| >> | | I may try also increase JVM heap size and try with a bit more data
| >> | | yet.
| >> | | 
| >> | | Radim
| >> | | 
| >> | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign
| >> | | 
| >> | | ----- Original Message -----
| >> | | | From: "Erik Salter" <an1310 at hotmail.com>
| >> | | | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
| >> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM
| >> | | | Subject: Re: [infinispan-dev] Cachestores performance
| >> | | | 
| >> | | | These were write-through cache stores, right?  And with LevelDB,
| >> | | | this was
| >> | | | through to the database file itself?
| >> | | | 
| >> | | | Erik
| >> | | | 
| >> | | | -----Original Message-----
| >> | | | From: infinispan-dev-bounces at lists.jboss.org
| >> | | | [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Radim
| >> | | | Vansa
| >> | | | Sent: Wednesday, June 26, 2013 11:24 AM
| >> | | | To: infinispan -Dev List
| >> | | | Subject: [infinispan-dev] Cachestores performance
| >> | | | 
| >> | | | Hi all,
| >> | | | 
| >> | | | according to [1] I've created the comparison of performance in
| >> | | | stress-tests.
| >> | | | 
| >> | | | All setups used local-cache, benchmark was executed via Radargun
| >> | | | (actually
| >> | | | version not merged into master yet [2]). I've used 4 nodes just to
| >> | | | get
| >> | | | more
| >> | | | data - each slave was absolutely independent of the others.
| >> | | | 
| >> | | | First test was preloading performance - the cache started and tried
| >> | | | to
| >> | | | load
| >> | | | 1GB of data from harddrive. Without cachestore the startup takes
| >> | | | about 2
| >> | | | -
| >> | | | 4
| >> | | | seconds, average numbers for the cachestores are below:
| >> | | | 
| >> | | | FileCacheStore:        9.8 s
| >> | | | KarstenFileCacheStore:  14 s
| >> | | | LevelDB-JAVA impl.:   12.3 s
| >> | | | LevelDB-JNI impl.:    12.9 s
| >> | | | 
| >> | | | IMO nothing special, all times seem affordable. We don't benchmark
| >> | | | exactly
| >> | | | storing the data into the cachestore, here FileCacheStore took about
| >> | | | 44
| >> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 minutes and
| >> | | | LevelDB-JNI 96 seconds. The units are right, it's minutes compared
| >> | | | to
| >> | | | seconds. But we all know that FileCacheStore is bloody slow.
| >> | | | 
| >> | | | Second test is stress test (5 minutes, preceded by 2 minute warmup)
| >> | | | where
| >> | | | each of 10 threads works on 10k entries with 1kB values (~100 MB in
| >> | | | total).
| >> | | | 20 % writes, 80 % reads, as usual. No eviction is configured,
| >> | | | therefore
| >> | | | the
| >> | | | cache-store works as a persistent storage only for case of crash.
| >> | | | 
| >> | | | FileCacheStore:         3.1M reads/s   112 writes/s  // on one node
| >> | | | the
| >> | | | performance was only 2.96M reads/s 75 writes/s
| >> | | | KarstenFileCacheStore:  9.2M reads/s  226k writes/s  // yikes!
| >> | | | LevelDB-JAVA impl.:     3.9M reads/s  5100 writes/s
| >> | | | LevelDB-JNI impl.:      6.6M reads/s   14k writes/s  // on one node
| >> | | | the
| >> | | | performance was 3.9M/8.3k - about half of the others
| >> | | | Without cache store:   15.5M reads/s  4.4M writes/s
| >> | | | 
| >> | | | Karsten implementation pretty rules here for two reasons. First of
| >> | | | all,
| >> | | | it
| >> | | | does not flush the data (it calls only RandomAccessFile.write()).
| >> | | | Other
| >> | | | cheat is that it stores in-memory the keys and offsets of data
| >> | | | values in
| >> | | | the
| >> | | | database file. Therefore, it's definitely the best choice for this
| >> | | | scenario,
| >> | | | but it does not allow to scale the cache-store, especially in cases
| >> | | | where
| >> | | | the keys are big and values small. However, this performance boost
| >> | | | is
| >> | | | definitely worth checking - I could think of caching the disk
| >> | | | offsets in
| >> | | | memory and querying persistent index only in case of missing record,
| >> | | | with
| >> | | | part of the persistent index flushed asynchronously (the index can
| >> | | | be
| >> | | | always
| >> | | | rebuilt during the preloading for case of crash).
| >> | | | 
| >> | | | The third test should have tested the scenario with more data to be
| >> | | | stored
| >> | | | than memory - therefore, the stressors operated on 100k entries
| >> | | | (~100 MB
| >> | | | of
| >> | | | data) but eviction was set to 10k entries (9216 entries ended up in
| >> | | | memory
| >> | | | after the test has ended).
| >> | | | 
| >> | | | FileCacheStore:            750 reads/s         285 writes/s  // one
| >> | | | node
| >> | | | had
| >> | | | only 524 reads and 213 writes per second
| >> | | | KarstenFileCacheStore:    458k reads/s        137k writes/s
| >> | | | LevelDB-JAVA impl.:        21k reads/s          9k writes/s  // a
| >> | | | bit
| >> | | | varying
| >> | | | performance
| >> | | | LevelDB-JNI impl.:     13k-46k reads/s  6.6k-15.2k writes/s  // the
| >> | | | performance varied a lot!
| >> | | | 
| >> | | | 100 MB of data is not much, but it takes so long to push it into
| >> | | | FileCacheStore that I won't use more unless we exclude this loser
| >> | | | from
| >> | | | the
| >> | | | comparison :)
| >> | | | 
| >> | | | Radim
| >> | | | 
| >> | | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign
| >> | | | [2] https://github.com/rvansa/radargun/tree/t_keygen
| >> | | | 
| >> | | | -----------------------------------------------------------
| >> | | | Radim Vansa
| >> | | | Quality Assurance Engineer
| >> | | | JBoss Datagrid
| >> | | | tel. +420532294559 ext. 62559
| >> | | | 
| >> | | | Red Hat Czech, s.r.o.
| >> | | | Brno, Purkyňova 99/71, PSČ 612 45
| >> | | | Czech Republic
| >> | | | 
| >> | | | 
| >> | | | _______________________________________________
| >> | | | infinispan-dev mailing list
| >> | | | infinispan-dev at lists.jboss.org
| >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
| >> | | | 
| >> | | | 
| >> | | | _______________________________________________
| >> | | | infinispan-dev mailing list
| >> | | | infinispan-dev at lists.jboss.org
| >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
| >> | | 
| >> | | _______________________________________________
| >> | | infinispan-dev mailing list
| >> | | infinispan-dev at lists.jboss.org
| >> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
| >> | 
| >> | _______________________________________________
| >> | infinispan-dev mailing list
| >> | infinispan-dev at lists.jboss.org
| >> | https://lists.jboss.org/mailman/listinfo/infinispan-dev
| >> 
| >> _______________________________________________
| >> infinispan-dev mailing list
| >> infinispan-dev at lists.jboss.org
| >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
| > 
| > --
| > Manik Surtani
| > manik at jboss.org
| > twitter.com/maniksurtani
| > 
| > Platform Architect, JBoss Data Grid
| > http://red.ht/data-grid
| > 
| > 
| > _______________________________________________
| > infinispan-dev mailing list
| > infinispan-dev at lists.jboss.org
| > https://lists.jboss.org/mailman/listinfo/infinispan-dev
| 
| 
| --
| Galder Zamarreño
| galder at redhat.com
| twitter.com/galderz
| 
| Project Lead, Escalante
| http://escalante.io
| 
| Engineer, Infinispan
| http://infinispan.org
| 
| 
| _______________________________________________
| infinispan-dev mailing list
| infinispan-dev at lists.jboss.org
| https://lists.jboss.org/mailman/listinfo/infinispan-dev