On Jun 27, 2013, at 2:26 PM, Galder Zamarreño <galder(a)redhat.com> wrote:
On Jun 27, 2013, at 1:58 PM, Galder Zamarreño <galder(a)redhat.com> wrote:
>
> On Jun 27, 2013, at 1:52 PM, Galder Zamarreño <galder(a)redhat.com> wrote:
>
>>
>> On Jun 27, 2013, at 1:25 PM, Manik Surtani <msurtani(a)redhat.com> wrote:
>>
>>> Good work, Radim.
>>>
>>> I presume you're collaborating with Galder on this?
>>
>> Yeah, we're collaborating. We came up with the test plan and cache stores to
test together :).
>>
>>> As for Karsten's FCS implementation, I too have issues with the key set
and value offsets being solely in memory. However I think that could be improved by
storing only a certain number of keys/offsets in memory, and flushing the rest to disk
again into an index file.
>>
>> ^ Karsten's implementation makes this relatively easy to achieve because it
already keeps this mapping in a LinkedHashMap (with a given max entries limit [1])
assuming removeEldestEntry() is overriden to flush to disk older entries. Some extra logic
would be needed to bring back data from the disk too… but your suggestion below is also
quite interesting...
>>
>>> I believe LevelDB follows a similar design, but I think Karsten's FCS
will perform better than LevelDB since it doesn't attempt to maintain a sorted
structure on disk.
>>
>> ^ In-memory, the structure can optionally be ordered if it's bound [1],
otherwise it's just a normal map. How would be store it at the disk level? B+ tree
with hashes of keys and then linked lists?
>>
>>> One approach to maintaining keys and offsets in memory could be a
WeakReference that points to the key stored in the in-memory DataContainer. Once evicted
from the DC, then the CacheStore impl would need to fetch the key again from the index
file before looking up the value in the actual store.
>>
>> ^ Hmmm, interesting idea… has the potential to safe the memory space by not
having to keep that extra data structure in the cache store.
>>
>>> This way we have hot items always in memory, semi-hot items with offsets in
memory and values on disk, and cold items needing to be read off disk entirely (both
offset and value). Also for write-through and write-behind, as long as the item is hot or
warm (key and offset in memory), writing will be pretty fast.
>>
>> My worry about Karsten's impl is writing actually. If you look at the last
performance numbers in [2], where we see the performance difference of force=true and
force=false in Karsten's cache store compared with LevelDB JNI, you see that
force=false is fastest, then JNI LevelDB, and the force=true. Me wonders what kind of
write guarantees LevelDB JNI provides (and the JAVA version)…
>
> ^ Oh, Radim mentioned this topic already in a previous post. LevelDB JAVA library
seems to provide force=true equivalent logic.
Actually Radim, maybe I misunderstood your comments earlier, but what is this
mmap/FileChannelTable stuff. Here's what I'm seeing:
LevelDBCacheStore implementation calls DB.put without any options [1]. Underneath,
iq80's DbImpl calls the other put with a newly constructed WriteOptions instance,
whose sync parameter is set to false. The default in Fuse's JNI implementation seems
to be sync=false too [2].
So, it might be interesting to re-run the test after changing line in [1] to pass a new
instance of org.iq80.leveldb.WriteOptions instance with sync=true… WDYT?
Btw, at least in the LevelDB Java implementation, FileChannelLogWriter is the one that is
using this sync parameter, and not
FileChannelTable. I'm not sure about the relationship of these two yet, but from I
see in the code, DbImpl uses FileChannelTable as a tableCache (what the heck is that? no
docu) and that doesn't seem to be touched when an entry is put.
What seems to be touched when putting an entry is the log, and the memTable, which is a
in-memoyr cache of keys to slices, and the log is where you write the slice.
^ If my suspicions are correct, then none of the LevelDB implementations (neither Java nor
JNI) are forcing writes, in which case Karsten's cache store could be a winner even in
the write area (really??). Just looked at Google's WriteOptions struct, and sync
default value is false too [1].
So, we definitely need to construct an instance of WriteOptions and pass true in the sync
parameter, and do the put calls with that. Then compare the results…
Cheers,
[1]
https://code.google.com/p/leveldb/source/browse/include/leveldb/options.h...
Cheers,
[1]
http://goo.gl/w6hMw
[2]
http://goo.gl/h2OxG
>
>>
>>> WDYT?
>>
>> [1]
http://goo.gl/rPYp2
>>
>>>
>>> - M
>>>
>>> On 27 Jun 2013, at 10:33, Radim Vansa <rvansa(a)redhat.com> wrote:
>>>
>>>> Oops, by the cache store I mean the previously-superfast
KarstenFileCacheStore implementation.
>>>>
>>>> ----- Original Message -----
>>>> | From: "Radim Vansa" <rvansa(a)redhat.com>
>>>> | To: "infinispan -Dev List"
<infinispan-dev(a)lists.jboss.org>
>>>> | Sent: Thursday, June 27, 2013 11:30:53 AM
>>>> | Subject: Re: [infinispan-dev] Cachestores performance
>>>> |
>>>> | I have added FileChannel.force(false) flushes after all write
operations in
>>>> | the cache store, and now the comparison is also updated with these
values.
>>>> |
>>>> | Radim
>>>> |
>>>> | ----- Original Message -----
>>>> | | From: "Radim Vansa" <rvansa(a)redhat.com>
>>>> | | To: "infinispan -Dev List"
<infinispan-dev(a)lists.jboss.org>
>>>> | | Sent: Thursday, June 27, 2013 8:54:25 AM
>>>> | | Subject: Re: [infinispan-dev] Cachestores performance
>>>> | |
>>>> | | Yep, write-through. LevelDB JAVA used FileChannelTable
implementation
>>>> | | (-Dleveldb.mmap), because Mmaping is not implemented very well and
causes
>>>> | | JVM crashes (I believe it's because of calling non-public API
via
>>>> | | reflection
>>>> | | - I've found post from the Oracle JVM guys discouraging the
particular
>>>> | | trick
>>>> | | it uses). After writing the record to the log, it calls
>>>> | | FileChannel.force(true), therefore, it should be really on the disc
by that
>>>> | | moment.
>>>> | | I have not looked into the JNI implementation but I expect the same.
>>>> | |
>>>> | | By the way, I have updated [1] with numbers when running on more data
(2 GB
>>>> | | instead of 100 MB). I won't retype it here, so look there. The
performance
>>>> | | is much lower.
>>>> | | I may try also increase JVM heap size and try with a bit more data
yet.
>>>> | |
>>>> | | Radim
>>>> | |
>>>> | | [1]
https://community.jboss.org/wiki/FileCacheStoreRedesign
>>>> | |
>>>> | | ----- Original Message -----
>>>> | | | From: "Erik Salter" <an1310(a)hotmail.com>
>>>> | | | To: "infinispan -Dev List"
<infinispan-dev(a)lists.jboss.org>
>>>> | | | Sent: Wednesday, June 26, 2013 7:40:19 PM
>>>> | | | Subject: Re: [infinispan-dev] Cachestores performance
>>>> | | |
>>>> | | | These were write-through cache stores, right? And with LevelDB,
this was
>>>> | | | through to the database file itself?
>>>> | | |
>>>> | | | Erik
>>>> | | |
>>>> | | | -----Original Message-----
>>>> | | | From: infinispan-dev-bounces(a)lists.jboss.org
>>>> | | | [mailto:infinispan-dev-bounces@lists.jboss.org] On Behalf Of Radim
Vansa
>>>> | | | Sent: Wednesday, June 26, 2013 11:24 AM
>>>> | | | To: infinispan -Dev List
>>>> | | | Subject: [infinispan-dev] Cachestores performance
>>>> | | |
>>>> | | | Hi all,
>>>> | | |
>>>> | | | according to [1] I've created the comparison of performance in
>>>> | | | stress-tests.
>>>> | | |
>>>> | | | All setups used local-cache, benchmark was executed via Radargun
>>>> | | | (actually
>>>> | | | version not merged into master yet [2]). I've used 4 nodes just
to get
>>>> | | | more
>>>> | | | data - each slave was absolutely independent of the others.
>>>> | | |
>>>> | | | First test was preloading performance - the cache started and tried
to
>>>> | | | load
>>>> | | | 1GB of data from harddrive. Without cachestore the startup takes
about 2
>>>> | | | -
>>>> | | | 4
>>>> | | | seconds, average numbers for the cachestores are below:
>>>> | | |
>>>> | | | FileCacheStore: 9.8 s
>>>> | | | KarstenFileCacheStore: 14 s
>>>> | | | LevelDB-JAVA impl.: 12.3 s
>>>> | | | LevelDB-JNI impl.: 12.9 s
>>>> | | |
>>>> | | | IMO nothing special, all times seem affordable. We don't
benchmark
>>>> | | | exactly
>>>> | | | storing the data into the cachestore, here FileCacheStore took
about 44
>>>> | | | minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 minutes
and
>>>> | | | LevelDB-JNI 96 seconds. The units are right, it's minutes
compared to
>>>> | | | seconds. But we all know that FileCacheStore is bloody slow.
>>>> | | |
>>>> | | | Second test is stress test (5 minutes, preceded by 2 minute warmup)
where
>>>> | | | each of 10 threads works on 10k entries with 1kB values (~100 MB
in
>>>> | | | total).
>>>> | | | 20 % writes, 80 % reads, as usual. No eviction is configured,
therefore
>>>> | | | the
>>>> | | | cache-store works as a persistent storage only for case of crash.
>>>> | | |
>>>> | | | FileCacheStore: 3.1M reads/s 112 writes/s // on one node
the
>>>> | | | performance was only 2.96M reads/s 75 writes/s
>>>> | | | KarstenFileCacheStore: 9.2M reads/s 226k writes/s // yikes!
>>>> | | | LevelDB-JAVA impl.: 3.9M reads/s 5100 writes/s
>>>> | | | LevelDB-JNI impl.: 6.6M reads/s 14k writes/s // on one node
the
>>>> | | | performance was 3.9M/8.3k - about half of the others
>>>> | | | Without cache store: 15.5M reads/s 4.4M writes/s
>>>> | | |
>>>> | | | Karsten implementation pretty rules here for two reasons. First of
all,
>>>> | | | it
>>>> | | | does not flush the data (it calls only RandomAccessFile.write()).
Other
>>>> | | | cheat is that it stores in-memory the keys and offsets of data
values in
>>>> | | | the
>>>> | | | database file. Therefore, it's definitely the best choice for
this
>>>> | | | scenario,
>>>> | | | but it does not allow to scale the cache-store, especially in cases
where
>>>> | | | the keys are big and values small. However, this performance boost
is
>>>> | | | definitely worth checking - I could think of caching the disk
offsets in
>>>> | | | memory and querying persistent index only in case of missing
record, with
>>>> | | | part of the persistent index flushed asynchronously (the index can
be
>>>> | | | always
>>>> | | | rebuilt during the preloading for case of crash).
>>>> | | |
>>>> | | | The third test should have tested the scenario with more data to
be
>>>> | | | stored
>>>> | | | than memory - therefore, the stressors operated on 100k entries
(~100 MB
>>>> | | | of
>>>> | | | data) but eviction was set to 10k entries (9216 entries ended up
in
>>>> | | | memory
>>>> | | | after the test has ended).
>>>> | | |
>>>> | | | FileCacheStore: 750 reads/s 285 writes/s // one
node
>>>> | | | had
>>>> | | | only 524 reads and 213 writes per second
>>>> | | | KarstenFileCacheStore: 458k reads/s 137k writes/s
>>>> | | | LevelDB-JAVA impl.: 21k reads/s 9k writes/s // a
bit
>>>> | | | varying
>>>> | | | performance
>>>> | | | LevelDB-JNI impl.: 13k-46k reads/s 6.6k-15.2k writes/s //
the
>>>> | | | performance varied a lot!
>>>> | | |
>>>> | | | 100 MB of data is not much, but it takes so long to push it into
>>>> | | | FileCacheStore that I won't use more unless we exclude this
loser from
>>>> | | | the
>>>> | | | comparison :)
>>>> | | |
>>>> | | | Radim
>>>> | | |
>>>> | | | [1]
https://community.jboss.org/wiki/FileCacheStoreRedesign
>>>> | | | [2]
https://github.com/rvansa/radargun/tree/t_keygen
>>>> | | |
>>>> | | | -----------------------------------------------------------
>>>> | | | Radim Vansa
>>>> | | | Quality Assurance Engineer
>>>> | | | JBoss Datagrid
>>>> | | | tel. +420532294559 ext. 62559
>>>> | | |
>>>> | | | Red Hat Czech, s.r.o.
>>>> | | | Brno, Purkyňova 99/71, PSČ 612 45
>>>> | | | Czech Republic
>>>> | | |
>>>> | | |
>>>> | | | _______________________________________________
>>>> | | | infinispan-dev mailing list
>>>> | | | infinispan-dev(a)lists.jboss.org
>>>> | | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> | | |
>>>> | | |
>>>> | | | _______________________________________________
>>>> | | | infinispan-dev mailing list
>>>> | | | infinispan-dev(a)lists.jboss.org
>>>> | | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> | |
>>>> | | _______________________________________________
>>>> | | infinispan-dev mailing list
>>>> | | infinispan-dev(a)lists.jboss.org
>>>> | |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> |
>>>> | _______________________________________________
>>>> | infinispan-dev mailing list
>>>> | infinispan-dev(a)lists.jboss.org
>>>> |
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev(a)lists.jboss.org
>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Manik Surtani
>>> manik(a)jboss.org
>>>
twitter.com/maniksurtani
>>>
>>> Platform Architect, JBoss Data Grid
>>>
http://red.ht/data-grid
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev(a)lists.jboss.org
>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> --
>> Galder Zamarreño
>> galder(a)redhat.com
>>
twitter.com/galderz
>>
>> Project Lead, Escalante
>>
http://escalante.io
>>
>> Engineer, Infinispan
>>
http://infinispan.org
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Galder Zamarreño
> galder(a)redhat.com
>
twitter.com/galderz
>
> Project Lead, Escalante
>
http://escalante.io
>
> Engineer, Infinispan
>
http://infinispan.org
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Galder Zamarreño
galder(a)redhat.com
twitter.com/galderz
Project Lead, Escalante
http://escalante.io
Engineer, Infinispan
http://infinispan.org
--
Galder Zamarreño
galder(a)redhat.com
twitter.com/galderz
Project Lead, Escalante
http://escalante.io
Engineer, Infinispan
http://infinispan.org