[infinispan-dev] Findings from the cache store corruption issue - ISPN-575

Fri Aug 6 16:58:17 EDT 2010

Hi Galder,
thanks, some answers are inline:

2010/8/6 Galder Zamarreño <galder at redhat.com>:
> Hi Sanne,
>
> I've looked at the CacheStoreStressTest in https://jira.jboss.org/browse/ISPN-575 and there's something I haven't understood:
>
> In the test, there appears to be a single thread which is a writer and the rest are reading.

Not totally relevant, but there are two threads writing to the index
directly, and they share a third thread - which is started by Lucene -
doing some background writing too (optimizes the structure).

> Now, when looking to SingleChunkIndexInput constructor, it appears that you're skipping locking for reading the chunk via:
>
>      buffer = (byte[]) cache.withFlags(Flag.SKIP_LOCKING).get(key);
>
> If you skip locking, how do you guarantee that you won't be reading data that's in the process of being updated? Is there some other locking strategy used to guarantee that correct reads? However, if this was the cause of the issue, I'd imagine it'd fail when no cache store is present as well, wouldn't it?

As you say if the design was wrong I'd expect the failures to happen
also without having any store configured, but this is a scenario I've
been stresstesting a lot already and no issues where found. I
initially introduced the SKIP_LOCKING flag as for some
yet-have-to-understand-reason I was getting lock timeout exceptions on
the get()s - Manik mentioned the L1 about this.
The trick behind the fact that I don't need the locks is that by
Lucene's design all segments are immutable: so unless a writer is not
done with a segment, no one else is able to read it. When any thread
is reading a segment, you're guaranteed that no more writes are going
to happen on that segment. The only tricky part is that segments might
be deleted while others are attempting to read it, but we handle that.

I'm going to try and see what happens if I remove Flag usage, just to
see why I'm the only one reporting corrupt data in the stores. I guess
that locking might have some effect beyond my knowledge of Infinispan.

>
> I'm away on vacation till next Friday, so don't have more time to look into it right now. I'm attaching a patch with some extra logging I've added. The key is finding out why the byte[] stored and retrieved are different. Maybe Mircea can help further from Monday onwards when he gets back from holidays?

I'm full of hope that someone will have some more time; thanks for all
the help and have nice vacations!

Cheers,
Sanne

>
> One last thing, setting TRACE on org.infinispan is a bit exagerated, generates huges files and does not fail. I've been playing with setting TRACE to org.infinispan.loaders and org.infinispan.lucene
>
> Cheers,
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>