Hi Galder,
you make many interesting points but I am not interested in discussing
my specific design ideas in detail, I just sketched that quickly as an
example of requirements description.
My intervention in this thread is all about demanding what the use
case is for such a synchronous cachestore.
It is my understanding that lots of people are currently discussing
how to best make a synchronous cachestore mostly efficient, but I yet
have to see what the requirements are. My use case - although
expecting an asynchronous one - is just an example of how I would like
to see the general architecture described first before we waste time
on a useless component.
I am not stating that nobody wants a strictly synchronous CacheStore,
but I would like to challenge the arguments who made someone (who?)
ask for this, as I believe there are many other things that would need
to be addressed.
Therefore, my suggestion for describing the main use cases we expect
to support is not off-topic at all, it's the first thing any engineer
would have requested, and I would not spend a minute more of our
engineers in coding without a clear description of the expected
architecture, expected reliability, expected operations to be used.
Sanne
On 3 July 2013 14:59, Galder Zamarreño <galder(a)redhat.com> wrote:
Sanne, let me comment on some of the points you raised that
didn't comment on in an earlier comment...
On Jul 2, 2013, at 2:47 PM, Sanne Grinovero <sanne(a)infinispan.org> wrote:
> It would be nice to have a deck of "cheat sheets" on the expected use
> cases and guarantees: to me it looks like everyone is trying to solve
> a different problem / having a different problem in mind.
>
> My own take on it:
>
> Scenario #1
> I'll primarily use Infinispan with DIST, and I don't care much for
> other options. Reliability is guaranteed via numOwners>1, NOT by
> persisting to disk: if a node fails, I kill the VM (the machine, not
> the Java process) and start new ones to compensate: I'm assuming cloud
> nodes, so it's likely that when a failed node is gone, the disk is
> gone as well, with all the carefully stored data.
> I will use Infinispan primarily to absorb write spikes - so a "synch
> flush" is no good for me - and to boost read performance by as much
> memory I can throw at it.
> CacheStore is used for two reasons:
> - overflow (LIRS+passivation) for then the memory is not enough
> - clean shutdown: you can think of it as a way to be able to upgrade
> some component in the system (Infinispan or my own); I would expect
> some kind of "JMX flush" operation to do a clean shutdown without data
> loss.
^ Should this really be implemented at the Infinispan level? In the AS/EAP/Wildfly case,
they take care that all transactions have finished before shutting down, and Infinispan
benefits from that.
> Given such a scenario, I am not interested at all in synchronous
> storage. Before we commit into a design which is basically assuming
> the need for synchronous storage guarantees, I'd like to understand
> what kind of use case it's aiming to solve.
Sanne, any **strict** synchronous storage guarantees (e.g. to force or to not) will be
configurable and most likely they'll be disabled, just like Level DB JNI, or
Karsten's file cache store by default. A case where someone might want to enable this
is when it just has a local cache and wants to persist data for recovery. Of course, the
whole node and the disk could die…, but this is not so far fetched IMO.
The whole discussion about **strict** synchronous storage guarantees in this thread is to
make sure we're comparing apples with apples. IOW, it doesn't make sense to
compare performance when each has different **strict** synchronous storage guarantee
settings.
> It would be great to document each such use case and put down a table
> of things which can be expected, which features should not be expected
> (be very explicit on the limitations), and how basic operations are
> expected to be performed in the scenario: like how do you do a rolling
> upgrade in Scenario 1# ? How do you do a backup? And of course some
> configurations & code examples.
^ Hmmm, these operations are not really specific to the file cache store per-se. They are
valid points, for sure, but out of the scope of this IMO.
> Only then we would be able to pick a design (or multiple ones); for my
> use case the proposal from Karsten seems excellent, so I'm wondering
> why I should be looking for alternatives, and wondering why everyone
> is still wasting time on different discussions :-D
>
> I'm pretty sure there is people looking forward for a synch-CacheStore
> too: if you could nail down such a scenario however I'm pretty sure
> that some other considerations would not be taken into account (like
> consistency of data when reactivating a dormant node), so I suspect
> that just implementing such a component would actually not make any
> new architecture possible, as you would get blocked by other problems
> which need to be solved too.. better define all expectations asap!
>
> To me this thread smells of needing the off-heap Direct Memory buffers
> which I suggested [long time ago] to efficiently offload internal
> buffers,
^ Hmmm, if we have a file based cache store is to provide data survival beyond shutting
down a machine or it crashing it (assuming no disk failure). So, I can't see how this
off-heap memory buffers help here? Unless you've got it mapped to a file or something
else?
> but failing to recognise this we're pushing responsibility to
> an epic level complex CacheStore.. guys let's not forget that a mayor
> bottleneck of CacheStores today is the SPI it has to implement, we
> identified several limitations in the contract in the past which
> prevent a superior efficiency: we're working towards a mayor release
> now so I'd rather focus on the API changes which will make it possible
> to get decent performance even without changing any storage engine..
If you haven't already done so, the place to suggest/comment on this is for sure
[1].
[1]
https://community.jboss.org/wiki/CacheLoaderAndCacheStoreSPIRedesign
> I'm pretty sure Cassandra (to pick one) doesn't scale too bad.
^ Requires a separate process and much more complex to set up. Not really what we're
looking for a simple local cache store that you can use for example for passivation EJB3
SFSBs or HTTP sessions.
>
> Cheers,
> Sanne
>
>
>
> On 2 July 2013 10:09, Radim Vansa <rvansa(a)redhat.com> wrote:
>> Hi,
>>
>> I've written down this proposal for the implementation of new cache store.
>>
>>
https://community.jboss.org/wiki/BrnoCacheStoreDesignProposal
>>
>> WDYT?
>>
>> Radim
>>
>> ----- Original Message -----
>> | From: "Radim Vansa" <rvansa(a)redhat.com>
>> | To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org>
>> | Sent: Thursday, June 27, 2013 2:37:43 PM
>> | Subject: Re: [infinispan-dev] Cachestores performance
>> |
>> |
>> |
>> | ----- Original Message -----
>> | | From: "Galder Zamarreño" <galder(a)redhat.com>
>> | | To: "infinispan -Dev List" <infinispan-dev(a)lists.jboss.org>
>> | | Sent: Thursday, June 27, 2013 1:52:11 PM
>> | | Subject: Re: [infinispan-dev] Cachestores performance
>> | |
>> | | > As for Karsten's FCS implementation, I too have issues with the key
set
>> | | > and
>> | | > value offsets being solely in memory. However I think that could be
>> | | > improved by storing only a certain number of keys/offsets in memory,
and
>> | | > flushing the rest to disk again into an index file.
>> | |
>> | | ^ Karsten's implementation makes this relatively easy to achieve because
it
>> | | already keeps this mapping in a LinkedHashMap (with a given max entries
>> | | limit [1]) assuming removeEldestEntry() is overriden to flush to disk older
>> | | entries. Some extra logic would be needed to bring back data from the disk
>> | | too… but your suggestion below is also quite interesting...
>> |
>> | I certainly wouldn't call this easy task, because the most problematic
part
>> | is what we will do when the whole entry (both key and value) are gone from
>> | memory and we want to read them - that requires keeping some searchable
>> | structure on-disk. And that's the hard stuff.
>> |
>> | |
>> | | > I believe LevelDB follows a similar design, but I think Karsten's
FCS
>> | | > will
>> | | > perform better than LevelDB since it doesn't attempt to maintain a
sorted
>> | | > structure on disk.
>> | |
>> | | ^ In-memory, the structure can optionally be ordered if it's bound [1],
>> | | otherwise it's just a normal map. How would be store it at the disk
level?
>> | | B+ tree with hashes of keys and then linked lists?
>> |
>> | Before choosing "I love B#@& trees, let's use B#@&
trees!", I'd find out what
>> | requirements do we have for the structure. I believe that the index itself
>> | should not be considered persistent, as it can be rebuilt when preloading
>> | the data (sequentially reading the data is fast, therefore we can afford do
>> | this indexing preload), the reason of the index being on-disk is that we
>> | don't have enough memory to store all keys, or even key hashes. Therefore
it
>> | does not have to be updated synchronously with the writes. It should be
>> | mostly read-optimized then, because that's the thing where we need
>> | synchronous access to this structure.
>> |
>> | |
>> | | > One approach to maintaining keys and offsets in memory could be a
>> | | > WeakReference that points to the key stored in the in-memory
>> | | > DataContainer. Once evicted from the DC, then the CacheStore impl
would
>> | | > need to fetch the key again from the index file before looking up the
>> | | > value in the actual store.
>> | |
>> | | ^ Hmmm, interesting idea… has the potential to safe the memory space by not
>> | | having to keep that extra data structure in the cache store.
>> |
>> | You mean to mix the DataContainer with xCacheEntry implementation and the
>> | cache store implementation? Is that possible from design perspective?
>> | Speaking about different kind of references, we may even optimize
>> | not-well-tuned eviction by SoftReferences, so that even if the entry was
>> | evicted from main DataContainer, we'd keep the value referenced from the
>> | cache-store (and this does not have to be loaded from disk if referenced
>> | before garbage collection). But such thought may be premature optimization.
>> | For having eviction managed in relation with GC we should rather combine
>> | this with PhantomReferences, where entries would be written to cache upon
>> | finalization.
>> |
>> | |
>> | | > This way we have hot items always in memory, semi-hot items with
offsets
>> | | > in
>> | | > memory and values on disk, and cold items needing to be read off disk
>> | | > entirely (both offset and value). Also for write-through and
>> | | > write-behind, as long as the item is hot or warm (key and offset in
>> | | > memory), writing will be pretty fast.
>> | |
>> | | My worry about Karsten's impl is writing actually. If you look at the
last
>> | | performance numbers in [2], where we see the performance difference of
>> | | force=true and force=false in Karsten's cache store compared with
LevelDB
>> | | JNI, you see that force=false is fastest, then JNI LevelDB, and the
>> | | force=true. Me wonders what kind of write guarantees LevelDB JNI provides
>> | | (and the JAVA version)...
>> |
>> | Just for clarification: the fast implementation is without force at all, the
>> | slower is with force(false). Force(true) means updating metadata (such as
>> | access times?) which is not required for cache-store.
>> | But the numbers suggest that the random access with syncing is really not a
>> | good option, and that we should rather use the temporary append-only log,
>> | which would be persisted into structured DB by different thread (as LevelDB
>> | does, I suppose).
>> |
>> | Thinking about all the levels and cache structures optimizing the read
>> | access, I can see four levels of search structures: key + value (usual
>> | DataContainer), key + offset, hash + offset, all on disk. The "hash +
>> | offset" may seem superflous but for some use-cases with big keys it may
be
>> | worth sparing a few disk look-ups.
>> |
>> | Radim
>> |
>> | | >
>> | | > On 27 Jun 2013, at 10:33, Radim Vansa <rvansa(a)redhat.com> wrote:
>> | | >
>> | | >> Oops, by the cache store I mean the previously-superfast
>> | | >> KarstenFileCacheStore implementation.
>> | | >>
>> | | >> ----- Original Message -----
>> | | >> | From: "Radim Vansa" <rvansa(a)redhat.com>
>> | | >> | To: "infinispan -Dev List"
<infinispan-dev(a)lists.jboss.org>
>> | | >> | Sent: Thursday, June 27, 2013 11:30:53 AM
>> | | >> | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> |
>> | | >> | I have added FileChannel.force(false) flushes after all write
>> | | >> | operations
>> | | >> | in
>> | | >> | the cache store, and now the comparison is also updated with
these
>> | | >> | values.
>> | | >> |
>> | | >> | Radim
>> | | >> |
>> | | >> | ----- Original Message -----
>> | | >> | | From: "Radim Vansa" <rvansa(a)redhat.com>
>> | | >> | | To: "infinispan -Dev List"
<infinispan-dev(a)lists.jboss.org>
>> | | >> | | Sent: Thursday, June 27, 2013 8:54:25 AM
>> | | >> | | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> | |
>> | | >> | | Yep, write-through. LevelDB JAVA used FileChannelTable
>> | | >> | | implementation
>> | | >> | | (-Dleveldb.mmap), because Mmaping is not implemented very well
and
>> | | >> | | causes
>> | | >> | | JVM crashes (I believe it's because of calling non-public
API via
>> | | >> | | reflection
>> | | >> | | - I've found post from the Oracle JVM guys discouraging the
>> | | >> | | particular
>> | | >> | | trick
>> | | >> | | it uses). After writing the record to the log, it calls
>> | | >> | | FileChannel.force(true), therefore, it should be really on the
disc
>> | | >> | | by
>> | | >> | | that
>> | | >> | | moment.
>> | | >> | | I have not looked into the JNI implementation but I expect the
same.
>> | | >> | |
>> | | >> | | By the way, I have updated [1] with numbers when running on
more
>> | | >> | | data
>> | | >> | | (2 GB
>> | | >> | | instead of 100 MB). I won't retype it here, so look there.
The
>> | | >> | | performance
>> | | >> | | is much lower.
>> | | >> | | I may try also increase JVM heap size and try with a bit more
data
>> | | >> | | yet.
>> | | >> | |
>> | | >> | | Radim
>> | | >> | |
>> | | >> | | [1]
https://community.jboss.org/wiki/FileCacheStoreRedesign
>> | | >> | |
>> | | >> | | ----- Original Message -----
>> | | >> | | | From: "Erik Salter" <an1310(a)hotmail.com>
>> | | >> | | | To: "infinispan -Dev List"
<infinispan-dev(a)lists.jboss.org>
>> | | >> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM
>> | | >> | | | Subject: Re: [infinispan-dev] Cachestores performance
>> | | >> | | |
>> | | >> | | | These were write-through cache stores, right? And with
LevelDB,
>> | | >> | | | this was
>> | | >> | | | through to the database file itself?
>> | | >> | | |
>> | | >> | | | Erik
>> | | >> | | |
>> | | >> | | | -----Original Message-----
>> | | >> | | | From: infinispan-dev-bounces(a)lists.jboss.org
>> | | >> | | | [mailto:infinispan-dev-bounces@lists.jboss.org] On Behalf Of
Radim
>> | | >> | | | Vansa
>> | | >> | | | Sent: Wednesday, June 26, 2013 11:24 AM
>> | | >> | | | To: infinispan -Dev List
>> | | >> | | | Subject: [infinispan-dev] Cachestores performance
>> | | >> | | |
>> | | >> | | | Hi all,
>> | | >> | | |
>> | | >> | | | according to [1] I've created the comparison of
performance in
>> | | >> | | | stress-tests.
>> | | >> | | |
>> | | >> | | | All setups used local-cache, benchmark was executed via
Radargun
>> | | >> | | | (actually
>> | | >> | | | version not merged into master yet [2]). I've used 4 nodes
just to
>> | | >> | | | get
>> | | >> | | | more
>> | | >> | | | data - each slave was absolutely independent of the others.
>> | | >> | | |
>> | | >> | | | First test was preloading performance - the cache started and
>> | | >> | | | tried
>> | | >> | | | to
>> | | >> | | | load
>> | | >> | | | 1GB of data from harddrive. Without cachestore the startup
takes
>> | | >> | | | about 2
>> | | >> | | | -
>> | | >> | | | 4
>> | | >> | | | seconds, average numbers for the cachestores are below:
>> | | >> | | |
>> | | >> | | | FileCacheStore: 9.8 s
>> | | >> | | | KarstenFileCacheStore: 14 s
>> | | >> | | | LevelDB-JAVA impl.: 12.3 s
>> | | >> | | | LevelDB-JNI impl.: 12.9 s
>> | | >> | | |
>> | | >> | | | IMO nothing special, all times seem affordable. We don't
benchmark
>> | | >> | | | exactly
>> | | >> | | | storing the data into the cachestore, here FileCacheStore
took
>> | | >> | | | about
>> | | >> | | | 44
>> | | >> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4
minutes
>> | | >> | | | and
>> | | >> | | | LevelDB-JNI 96 seconds. The units are right, it's minutes
compared
>> | | >> | | | to
>> | | >> | | | seconds. But we all know that FileCacheStore is bloody slow.
>> | | >> | | |
>> | | >> | | | Second test is stress test (5 minutes, preceded by 2 minute
>> | | >> | | | warmup)
>> | | >> | | | where
>> | | >> | | | each of 10 threads works on 10k entries with 1kB values (~100
MB
>> | | >> | | | in
>> | | >> | | | total).
>> | | >> | | | 20 % writes, 80 % reads, as usual. No eviction is configured,
>> | | >> | | | therefore
>> | | >> | | | the
>> | | >> | | | cache-store works as a persistent storage only for case of
crash.
>> | | >> | | |
>> | | >> | | | FileCacheStore: 3.1M reads/s 112 writes/s // on
one
>> | | >> | | | node
>> | | >> | | | the
>> | | >> | | | performance was only 2.96M reads/s 75 writes/s
>> | | >> | | | KarstenFileCacheStore: 9.2M reads/s 226k writes/s //
yikes!
>> | | >> | | | LevelDB-JAVA impl.: 3.9M reads/s 5100 writes/s
>> | | >> | | | LevelDB-JNI impl.: 6.6M reads/s 14k writes/s // on
one
>> | | >> | | | node
>> | | >> | | | the
>> | | >> | | | performance was 3.9M/8.3k - about half of the others
>> | | >> | | | Without cache store: 15.5M reads/s 4.4M writes/s
>> | | >> | | |
>> | | >> | | | Karsten implementation pretty rules here for two reasons.
First of
>> | | >> | | | all,
>> | | >> | | | it
>> | | >> | | | does not flush the data (it calls only
RandomAccessFile.write()).
>> | | >> | | | Other
>> | | >> | | | cheat is that it stores in-memory the keys and offsets of
data
>> | | >> | | | values in
>> | | >> | | | the
>> | | >> | | | database file. Therefore, it's definitely the best choice
for this
>> | | >> | | | scenario,
>> | | >> | | | but it does not allow to scale the cache-store, especially in
>> | | >> | | | cases
>> | | >> | | | where
>> | | >> | | | the keys are big and values small. However, this performance
boost
>> | | >> | | | is
>> | | >> | | | definitely worth checking - I could think of caching the disk
>> | | >> | | | offsets in
>> | | >> | | | memory and querying persistent index only in case of missing
>> | | >> | | | record,
>> | | >> | | | with
>> | | >> | | | part of the persistent index flushed asynchronously (the index
can
>> | | >> | | | be
>> | | >> | | | always
>> | | >> | | | rebuilt during the preloading for case of crash).
>> | | >> | | |
>> | | >> | | | The third test should have tested the scenario with more data
to
>> | | >> | | | be
>> | | >> | | | stored
>> | | >> | | | than memory - therefore, the stressors operated on 100k
entries
>> | | >> | | | (~100 MB
>> | | >> | | | of
>> | | >> | | | data) but eviction was set to 10k entries (9216 entries ended
up
>> | | >> | | | in
>> | | >> | | | memory
>> | | >> | | | after the test has ended).
>> | | >> | | |
>> | | >> | | | FileCacheStore: 750 reads/s 285 writes/s
//
>> | | >> | | | one
>> | | >> | | | node
>> | | >> | | | had
>> | | >> | | | only 524 reads and 213 writes per second
>> | | >> | | | KarstenFileCacheStore: 458k reads/s 137k writes/s
>> | | >> | | | LevelDB-JAVA impl.: 21k reads/s 9k writes/s
// a
>> | | >> | | | bit
>> | | >> | | | varying
>> | | >> | | | performance
>> | | >> | | | LevelDB-JNI impl.: 13k-46k reads/s 6.6k-15.2k writes/s
//
>> | | >> | | | the
>> | | >> | | | performance varied a lot!
>> | | >> | | |
>> | | >> | | | 100 MB of data is not much, but it takes so long to push it
into
>> | | >> | | | FileCacheStore that I won't use more unless we exclude
this loser
>> | | >> | | | from
>> | | >> | | | the
>> | | >> | | | comparison :)
>> | | >> | | |
>> | | >> | | | Radim
>> | | >> | | |
>> | | >> | | | [1]
https://community.jboss.org/wiki/FileCacheStoreRedesign
>> | | >> | | | [2]
https://github.com/rvansa/radargun/tree/t_keygen
>> | | >> | | |
>> | | >> | | | -----------------------------------------------------------
>> | | >> | | | Radim Vansa
>> | | >> | | | Quality Assurance Engineer
>> | | >> | | | JBoss Datagrid
>> | | >> | | | tel. +420532294559 ext. 62559
>> | | >> | | |
>> | | >> | | | Red Hat Czech, s.r.o.
>> | | >> | | | Brno, Purkyňova 99/71, PSČ 612 45
>> | | >> | | | Czech Republic
>> | | >> | | |
>> | | >> | | |
>> | | >> | | | _______________________________________________
>> | | >> | | | infinispan-dev mailing list
>> | | >> | | | infinispan-dev(a)lists.jboss.org
>> | | >> | | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> | | |
>> | | >> | | |
>> | | >> | | | _______________________________________________
>> | | >> | | | infinispan-dev mailing list
>> | | >> | | | infinispan-dev(a)lists.jboss.org
>> | | >> | | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> | |
>> | | >> | | _______________________________________________
>> | | >> | | infinispan-dev mailing list
>> | | >> | | infinispan-dev(a)lists.jboss.org
>> | | >> | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >> |
>> | | >> | _______________________________________________
>> | | >> | infinispan-dev mailing list
>> | | >> | infinispan-dev(a)lists.jboss.org
>> | | >> |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >>
>> | | >> _______________________________________________
>> | | >> infinispan-dev mailing list
>> | | >> infinispan-dev(a)lists.jboss.org
>> | | >>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | | >
>> | | > --
>> | | > Manik Surtani
>> | | > manik(a)jboss.org
>> | | >
twitter.com/maniksurtani
>> | | >
>> | | > Platform Architect, JBoss Data Grid
>> | | >
http://red.ht/data-grid
>> | | >
>> | | >
>> | | > _______________________________________________
>> | | > infinispan-dev mailing list
>> | | > infinispan-dev(a)lists.jboss.org
>> | | >
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> | |
>> | |
>> | | --
>> | | Galder Zamarreño
>> | | galder(a)redhat.com
>> | |
twitter.com/galderz
>> | |
>> | | Project Lead, Escalante
>> | |
http://escalante.io
>> | |
>> | | Engineer, Infinispan
>> | |
http://infinispan.org
>> | |
>> | |
>> | | _______________________________________________
>> | | infinispan-dev mailing list
>> | | infinispan-dev(a)lists.jboss.org
>> | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> |
>> | _______________________________________________
>> | infinispan-dev mailing list
>> | infinispan-dev(a)lists.jboss.org
>> |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Galder Zamarreño
galder(a)redhat.com
twitter.com/galderz
Project Lead, Escalante
http://escalante.io
Engineer, Infinispan
http://infinispan.org
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev