[infinispan-dev] Cachestores performance

Tue Jul 2 14:01:13 EDT 2013

Thanks Erik ! exactly the kind of use-case description I'd like to collect.

If I could summarise the main differences with my use case, it's that
you are implying a non-shared (local) CacheStore right?

I wouldn't mind having that on my own wishlist too, but as you also
point out it's unclear how to restore the grid state from the various
pieces of stored data. In my case I'm assuming a single shared
CacheStore as that's the only scenario in which consistency in the
CacheStore isn't puzzling me too much.

On to your recovery approach: I don't think collecting a copy of all
SST files will save you: many nodes might have older copies of the
same entry which where not cleaned up after they lost the role of
"owners" on that particular entry.

You could even enter the messy situation of nodes which
 - are owners for K1
 - store some value K1->V1
 - is no longer an owner
 - K1 is deleted by other activity
 - you become an owner again (no state transfer happens as the entry is gone)
 - you receive a GET and serve it from your CacheStore

.. and I could make up more horror stories like this ;-)
Bottom-line: I would not advice DIST + local cachestores as long as
your app has some peculiarity (like never reusing a key) which avoids
such situations. At least, this is why I can't use it.
This is just an example of the kinds of limitations I would like to
collect on a per-configuration basis.

Sanne

On 2 July 2013 18:39, Erik Salter <an1310 at hotmail.com> wrote:
> I concur with part of the below, but with a few changes:
>
> - The cache is the primary storage, similar to Sanne's case. (DIST mode)
> - My customers are not interested in extra components to the system, like databases or Cassandra nodes.  They wonder why they can't simply use the existing file system on the nodes they have.
> - I'm only going to be using the filesystem to recover in the case of upgrades and catastrophic failures.  So during normal operation, flushes to disk cannot impact cluster performance.
> - Most importantly, there needs to be a way, scripted or otherwise, to recover the keys from local storage in a DIST-mode cache.  I cannot guarantee anything regarding node ordering, so anything about persisting segment info/previous CHs are out.  If that means copying all LevelDB SST files to all nodes and restarting them, that's fine.
>
> At the executive levels of my customer, they don't see (or really care about) the differentiation between data grids and MySQL -- only that one has file-based persistence and the other doesn't.
>
> In production, we've already taken a massive outage where a unbelievable series of coincidences occurred to reveal a JBoss AS bug that ended up deadlocking all threads in the cluster and we had to restart all nodes.   And I'm sure it'll happen again.
>
> Hope this offers some user perspective.
>
> Erik
>
> -----Original Message-----
> From: infinispan-dev-bounces at lists.jboss.org [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Sanne Grinovero
> Sent: Tuesday, July 02, 2013 8:47 AM
> To: infinispan -Dev List
> Subject: Re: [infinispan-dev] Cachestores performance
>
> It would be nice to have a deck of "cheat sheets" on the expected use cases and guarantees: to me it looks like everyone is trying to solve a different problem / having a different problem in mind.
>
> My own take on it:
>
> Scenario #1
> I'll primarily use Infinispan with DIST, and I don't care much for other options. Reliability is guaranteed via numOwners>1, NOT by persisting to disk: if a node fails, I kill the VM (the machine, not the Java process) and start new ones to compensate: I'm assuming cloud nodes, so it's likely that when a failed node is gone, the disk is gone as well, with all the carefully stored data.
> I will use Infinispan primarily to absorb write spikes - so a "synch flush" is no good for me - and to boost read performance by as much memory I can throw at it.
> CacheStore is used for two reasons:
>  - overflow (LIRS+passivation) for then the memory is not enough
>  - clean shutdown: you can think of it as a way to be able to upgrade some component in the system (Infinispan or my own); I would expect some kind of "JMX flush" operation to do a clean shutdown without data loss.
>
> Given such a scenario, I am not interested at all in synchronous storage. Before we commit into a design which is basically assuming the need for synchronous storage guarantees, I'd like to understand what kind of use case it's aiming to solve.
>
> It would be great to document each such use case and put down a table of things which can be expected, which features should not be expected (be very explicit on the limitations), and how basic operations are expected to be performed in the scenario: like how do you do a rolling upgrade in Scenario 1# ? How do you do a backup? And of course some configurations & code examples.
>
> Only then we would be able to pick a design (or multiple ones); for my use case the proposal from Karsten seems excellent, so I'm wondering why I should be looking for alternatives, and wondering why everyone is still wasting time on different discussions :-D
>
> I'm pretty sure there is people looking forward for a synch-CacheStore
> too: if you could nail down such a scenario however I'm pretty sure that some other considerations would not be taken into account (like consistency of data when reactivating a dormant node), so I suspect that just implementing such a component would actually not make any new architecture possible, as you would get blocked by other problems which need to be solved too.. better define all expectations asap!
>
> To me this thread smells of needing the off-heap Direct Memory buffers which I suggested [long time ago] to efficiently offload internal buffers, but failing to recognise this we're pushing responsibility to an epic level complex CacheStore.. guys let's not forget that a mayor bottleneck of CacheStores today is the SPI it has to implement, we identified several limitations in the contract in the past which prevent a superior efficiency: we're working towards a mayor release now so I'd rather focus on the API changes which will make it possible to get decent performance even without changing any storage engine..
> I'm pretty sure Cassandra (to pick one) doesn't scale too bad.
>
> Cheers,
> Sanne
>
>
>
> On 2 July 2013 10:09, Radim Vansa <rvansa at redhat.com> wrote:
>> Hi,
>>
>> I've written down this proposal for the implementation of new cache store.
>>
>> https://community.jboss.org/wiki/BrnoCacheStoreDesignProposal
>>
>> WDYT?
>>
>> Radim
>>
>> ----- Original Message -----
>> | From: "Radim Vansa" <rvansa at redhat.com>
>> | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
>> | Sent: Thursday, June 27, 2013 2:37:43 PM
>> | Subject: Re: [infinispan-dev] Cachestores performance
>> |
>> |
>> |
>> | ----- Original Message -----
>> | | From: "Galder Zamarreño" <galder at redhat.com>
>> | | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
>> | | Sent: Thursday, June 27, 2013 1:52:11 PM
>> | | Subject: Re: [infinispan-dev] Cachestores performance
>> | |
>> | | > As for Karsten's FCS implementation, I too have issues with the
>> | | > key set and value offsets being solely in memory.  However I
>> | | > think that could be improved by storing only a certain number of
>> | | > keys/offsets in memory, and flushing the rest to disk again into
>> | | > an index file.
>> | |
>> | | ^ Karsten's implementation makes this relatively easy to achieve
>> | | because it already keeps this mapping in a LinkedHashMap (with a
>> | | given max entries limit [1]) assuming removeEldestEntry() is
>> | | overriden to flush to disk older entries. Some extra logic would
>> | | be needed to bring back data from the disk too… but your suggestion below is also quite interesting...
>> |
>> | I certainly wouldn't call this easy task, because the most
>> | problematic part is what we will do when the whole entry (both key
>> | and value) are gone from memory and we want to read them - that
>> | requires keeping some searchable structure on-disk. And that's the hard stuff.
>> |
>> | |
>> | | > I believe LevelDB follows a similar design, but I think
>> | | > Karsten's FCS will perform better than LevelDB since it doesn't
>> | | > attempt to maintain a sorted structure on disk.
>> | |
>> | | ^ In-memory, the structure can optionally be ordered if it's bound
>> | | [1], otherwise it's just a normal map. How would be store it at the disk level?
>> | | B+ tree with hashes of keys and then linked lists?
>> |
>> | Before choosing "I love B#@& trees, let's use B#@& trees!", I'd find
>> | out what requirements do we have for the structure. I believe that
>> | the index itself should not be considered persistent, as it can be
>> | rebuilt when preloading the data (sequentially reading the data is
>> | fast, therefore we can afford do this indexing preload), the reason
>> | of the index being on-disk is that we don't have enough memory to
>> | store all keys, or even key hashes. Therefore it does not have to be
>> | updated synchronously with the writes. It should be mostly
>> | read-optimized then, because that's the thing where we need synchronous access to this structure.
>> |
>> | |
>> | | > One approach to maintaining keys and offsets in memory could be
>> | | > a WeakReference that points to the key stored in the in-memory
>> | | > DataContainer.  Once evicted from the DC, then the CacheStore
>> | | > impl would need to fetch the key again from the index file
>> | | > before looking up the value in the actual store.
>> | |
>> | | ^ Hmmm, interesting idea… has the potential to safe the memory
>> | | space by not having to keep that extra data structure in the cache store.
>> |
>> | You mean to mix the DataContainer with xCacheEntry implementation
>> | and the cache store implementation? Is that possible from design perspective?
>> | Speaking about different kind of references, we may even optimize
>> | not-well-tuned eviction by SoftReferences, so that even if the entry
>> | was evicted from main DataContainer, we'd keep the value referenced
>> | from the cache-store (and this does not have to be loaded from disk
>> | if referenced before garbage collection). But such thought may be premature optimization.
>> | For having eviction managed in relation with GC we should rather
>> | combine this with PhantomReferences, where entries would be written
>> | to cache upon finalization.
>> |
>> | |
>> | | > This way we have hot items always in memory, semi-hot items with
>> | | > offsets in memory and values on disk, and cold items needing to
>> | | > be read off disk entirely (both offset and value).  Also for
>> | | > write-through and write-behind, as long as the item is hot or
>> | | > warm (key and offset in memory), writing will be pretty fast.
>> | |
>> | | My worry about Karsten's impl is writing actually. If you look at
>> | | the last performance numbers in [2], where we see the performance
>> | | difference of force=true and force=false in Karsten's cache store
>> | | compared with LevelDB JNI, you see that force=false is fastest,
>> | | then JNI LevelDB, and the force=true. Me wonders what kind of
>> | | write guarantees LevelDB JNI provides (and the JAVA version)...
>> |
>> | Just for clarification: the fast implementation is without force at
>> | all, the slower is with force(false). Force(true) means updating
>> | metadata (such as access times?) which is not required for cache-store.
>> | But the numbers suggest that the random access with syncing is
>> | really not a good option, and that we should rather use the
>> | temporary append-only log, which would be persisted into structured
>> | DB by different thread (as LevelDB does, I suppose).
>> |
>> | Thinking about all the levels and cache structures optimizing the
>> | read access, I can see four levels of search structures: key + value
>> | (usual DataContainer), key + offset, hash + offset, all on disk. The
>> | "hash + offset" may seem superflous but for some use-cases with big
>> | keys it may be worth sparing a few disk look-ups.
>> |
>> | Radim
>> |
>> | | >
>> | | > On 27 Jun 2013, at 10:33, Radim Vansa <rvansa at redhat.com> wrote:
>> | | >
>> | | >> Oops, by the cache store I mean the previously-superfast
>> | | >> KarstenFileCacheStore implementation.
>> | | >>
>> | | >> ----- Original Message -----
>> | | >> | From: "Radim Vansa" <rvansa at redhat.com>
>> | | >> | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
>> | | >> | Sent: Thursday, June 27, 2013 11:30:53 AM
>> | | >> | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> |
>> | | >> | I have added FileChannel.force(false) flushes after all write
>> | | >> | operations in the cache store, and now the comparison is also
>> | | >> | updated with these values.
>> | | >> |
>> | | >> | Radim
>> | | >> |
>> | | >> | ----- Original Message -----
>> | | >> | | From: "Radim Vansa" <rvansa at redhat.com>
>> | | >> | | To: "infinispan -Dev List" <infinispan-dev at lists.jboss.org>
>> | | >> | | Sent: Thursday, June 27, 2013 8:54:25 AM
>> | | >> | | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> | |
>> | | >> | | Yep, write-through. LevelDB JAVA used FileChannelTable
>> | | >> | | implementation (-Dleveldb.mmap), because Mmaping is not
>> | | >> | | implemented very well and causes JVM crashes (I believe
>> | | >> | | it's because of calling non-public API via reflection
>> | | >> | | - I've found post from the Oracle JVM guys discouraging the
>> | | >> | | particular trick it uses). After writing the record to the
>> | | >> | | log, it calls FileChannel.force(true), therefore, it should
>> | | >> | | be really on the disc by that moment.
>> | | >> | | I have not looked into the JNI implementation but I expect the same.
>> | | >> | |
>> | | >> | | By the way, I have updated [1] with numbers when running on
>> | | >> | | more data
>> | | >> | | (2 GB
>> | | >> | | instead of 100 MB). I won't retype it here, so look there.
>> | | >> | | The performance is much lower.
>> | | >> | | I may try also increase JVM heap size and try with a bit
>> | | >> | | more data yet.
>> | | >> | |
>> | | >> | | Radim
>> | | >> | |
>> | | >> | | [1] https://community.jboss.org/wiki/FileCacheStoreRedesign
>> | | >> | |
>> | | >> | | ----- Original Message -----
>> | | >> | | | From: "Erik Salter" <an1310 at hotmail.com>
>> | | >> | | | To: "infinispan -Dev List"
>> | | >> | | | <infinispan-dev at lists.jboss.org>
>> | | >> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM
>> | | >> | | | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> | | |
>> | | >> | | | These were write-through cache stores, right?  And with
>> | | >> | | | LevelDB, this was through to the database file itself?
>> | | >> | | |
>> | | >> | | | Erik
>> | | >> | | |
>> | | >> | | | -----Original Message-----
>> | | >> | | | From: infinispan-dev-bounces at lists.jboss.org
>> | | >> | | | [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf
>> | | >> | | | Of Radim Vansa
>> | | >> | | | Sent: Wednesday, June 26, 2013 11:24 AM
>> | | >> | | | To: infinispan -Dev List
>> | | >> | | | Subject: [infinispan-dev] Cachestores performance
>> | | >> | | |
>> | | >> | | | Hi all,
>> | | >> | | |
>> | | >> | | | according to [1] I've created the comparison of
>> | | >> | | | performance in stress-tests.
>> | | >> | | |
>> | | >> | | | All setups used local-cache, benchmark was executed via
>> | | >> | | | Radargun (actually version not merged into master yet
>> | | >> | | | [2]). I've used 4 nodes just to get more data - each
>> | | >> | | | slave was absolutely independent of the others.
>> | | >> | | |
>> | | >> | | | First test was preloading performance - the cache started
>> | | >> | | | and tried to load 1GB of data from harddrive. Without
>> | | >> | | | cachestore the startup takes about 2
>> | | >> | | | -
>> | | >> | | | 4
>> | | >> | | | seconds, average numbers for the cachestores are below:
>> | | >> | | |
>> | | >> | | | FileCacheStore:        9.8 s
>> | | >> | | | KarstenFileCacheStore:  14 s
>> | | >> | | | LevelDB-JAVA impl.:   12.3 s
>> | | >> | | | LevelDB-JNI impl.:    12.9 s
>> | | >> | | |
>> | | >> | | | IMO nothing special, all times seem affordable. We don't
>> | | >> | | | benchmark exactly storing the data into the cachestore,
>> | | >> | | | here FileCacheStore took about
>> | | >> | | | 44
>> | | >> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4
>> | | >> | | | minutes and LevelDB-JNI 96 seconds. The units are right,
>> | | >> | | | it's minutes compared to seconds. But we all know that
>> | | >> | | | FileCacheStore is bloody slow.
>> | | >> | | |
>> | | >> | | | Second test is stress test (5 minutes, preceded by 2
>> | | >> | | | minute
>> | | >> | | | warmup)
>> | | >> | | | where
>> | | >> | | | each of 10 threads works on 10k entries with 1kB values
>> | | >> | | | (~100 MB in total).
>> | | >> | | | 20 % writes, 80 % reads, as usual. No eviction is
>> | | >> | | | configured, therefore the cache-store works as a
>> | | >> | | | persistent storage only for case of crash.
>> | | >> | | |
>> | | >> | | | FileCacheStore:         3.1M reads/s   112 writes/s  // on one
>> | | >> | | | node
>> | | >> | | | the
>> | | >> | | | performance was only 2.96M reads/s 75 writes/s
>> | | >> | | | KarstenFileCacheStore:  9.2M reads/s  226k writes/s  // yikes!
>> | | >> | | | LevelDB-JAVA impl.:     3.9M reads/s  5100 writes/s
>> | | >> | | | LevelDB-JNI impl.:      6.6M reads/s   14k writes/s  // on one
>> | | >> | | | node
>> | | >> | | | the
>> | | >> | | | performance was 3.9M/8.3k - about half of the others
>> | | >> | | | Without cache store:   15.5M reads/s  4.4M writes/s
>> | | >> | | |
>> | | >> | | | Karsten implementation pretty rules here for two reasons.
>> | | >> | | | First of all, it does not flush the data (it calls only
>> | | >> | | | RandomAccessFile.write()).
>> | | >> | | | Other
>> | | >> | | | cheat is that it stores in-memory the keys and offsets of
>> | | >> | | | data values in the database file. Therefore, it's
>> | | >> | | | definitely the best choice for this scenario, but it does
>> | | >> | | | not allow to scale the cache-store, especially in cases
>> | | >> | | | where the keys are big and values small. However, this
>> | | >> | | | performance boost is definitely worth checking - I could
>> | | >> | | | think of caching the disk offsets in memory and querying
>> | | >> | | | persistent index only in case of missing record, with
>> | | >> | | | part of the persistent index flushed asynchronously (the
>> | | >> | | | index can be always rebuilt during the preloading for
>> | | >> | | | case of crash).
>> | | >> | | |
>> | | >> | | | The third test should have tested the scenario with more
>> | | >> | | | data to be stored than memory - therefore, the stressors
>> | | >> | | | operated on 100k entries
>> | | >> | | | (~100 MB
>> | | >> | | | of
>> | | >> | | | data) but eviction was set to 10k entries (9216 entries
>> | | >> | | | ended up in memory after the test has ended).
>> | | >> | | |
>> | | >> | | | FileCacheStore:            750 reads/s         285 writes/s  //
>> | | >> | | | one
>> | | >> | | | node
>> | | >> | | | had
>> | | >> | | | only 524 reads and 213 writes per second
>> | | >> | | | KarstenFileCacheStore:    458k reads/s        137k writes/s
>> | | >> | | | LevelDB-JAVA impl.:        21k reads/s          9k writes/s  // a
>> | | >> | | | bit
>> | | >> | | | varying
>> | | >> | | | performance
>> | | >> | | | LevelDB-JNI impl.:     13k-46k reads/s  6.6k-15.2k writes/s  //
>> | | >> | | | the
>> | | >> | | | performance varied a lot!
>> | | >> | | |
>> | | >> | | | 100 MB of data is not much, but it takes so long to push
>> | | >> | | | it into FileCacheStore that I won't use more unless we
>> | | >> | | | exclude this loser from the comparison :)
>> | | >> | | |
>> | | >> | | | Radim
>> | | >> | | |
>> | | >> | | | [1]
>> | | >> | | | https://community.jboss.org/wiki/FileCacheStoreRedesign
>> | | >> | | | [2] https://github.com/rvansa/radargun/tree/t_keygen
>> | | >> | | |
>> | | >> | | | ---------------------------------------------------------
>> | | >> | | | --
>> | | >> | | | Radim Vansa
>> | | >> | | | Quality Assurance Engineer JBoss Datagrid tel.
>> | | >> | | | +420532294559 ext. 62559
>> | | >> | | |
>> | | >> | | | Red Hat Czech, s.r.o.
>> | | >> | | | Brno, Purkyňova 99/71, PSČ 612 45 Czech Republic
>> | | >> | | |
>> | | >> | | |
>> | | >> | | | _______________________________________________
>> | | >> | | | infinispan-dev mailing list
>> | | >> | | | infinispan-dev at lists.jboss.org
>> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> | | |
>> | | >> | | |
>> | | >> | | | _______________________________________________
>> | | >> | | | infinispan-dev mailing list
>> | | >> | | | infinispan-dev at lists.jboss.org
>> | | >> | | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> | |
>> | | >> | | _______________________________________________
>> | | >> | | infinispan-dev mailing list
>> | | >> | | infinispan-dev at lists.jboss.org
>> | | >> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> |
>> | | >> | _______________________________________________
>> | | >> | infinispan-dev mailing list
>> | | >> | infinispan-dev at lists.jboss.org
>> | | >> | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >>
>> | | >> _______________________________________________
>> | | >> infinispan-dev mailing list
>> | | >> infinispan-dev at lists.jboss.org
>> | | >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >
>> | | > --
>> | | > Manik Surtani
>> | | > manik at jboss.org
>> | | > twitter.com/maniksurtani
>> | | >
>> | | > Platform Architect, JBoss Data Grid
>> | | > http://red.ht/data-grid
>> | | >
>> | | >
>> | | > _______________________________________________
>> | | > infinispan-dev mailing list
>> | | > infinispan-dev at lists.jboss.org
>> | | > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | |
>> | |
>> | | --
>> | | Galder Zamarreño
>> | | galder at redhat.com
>> | | twitter.com/galderz
>> | |
>> | | Project Lead, Escalante
>> | | http://escalante.io
>> | |
>> | | Engineer, Infinispan
>> | | http://infinispan.org
>> | |
>> | |
>> | | _______________________________________________
>> | | infinispan-dev mailing list
>> | | infinispan-dev at lists.jboss.org
>> | | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> |
>> | _______________________________________________
>> | infinispan-dev mailing list
>> | infinispan-dev at lists.jboss.org
>> | https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev