[infinispan-dev] L1 consistency for transactional caches.

Sanne Grinovero sanne at infinispan.org
Sun Jul 7 17:19:00 EDT 2013


On 3 July 2013 10:26, Dan Berindei <dan.berindei at gmail.com> wrote:
>
>
>
> On Tue, Jul 2, 2013 at 8:41 PM, Sanne Grinovero <sanne at infinispan.org>
> wrote:
>>
>> On 2 July 2013 17:24, Dan Berindei <dan.berindei at gmail.com> wrote:
>> > It's not wrong, sending the invalidation only from the primary owner is
>> > wrong :)
>>
>> Agreed, sending a GET operation to multiple nodes might not be wrong
>> per-se but is the root cause of such race conditions, and other subtle
>> complexities we might not even be aware of yet.
>>
>> I don't know why it was slower, but since the result doesn't make
>> sense we should look at it a second time rather than throwing the code
>> away.
>>
>
> It does make sense: statistically, the backup owner will sometimes reply
> faster than the primary owner.
>
> http://markmail.org/message/qmpn7yueym4tbnve
>
> http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/

Of course, I remember the discussion but you can put may questions
marks on this decision. First off, this is doubling the load on
network, which is supposedly our most precious resource, so I highly
question how we're measuring the "benefit" of the second request.
If you read the articles you linked to you'll see google applies such
a strategy to improve the tail latency, but only sends the second
request when the first one is not getting a fast answer, and in their
tests this seems to pay off to a mere 5% of increased network usage..
I would say that's a significantly different level of trade off.
Also, opposed to Google's BigTable and Apache Cassandra which use
these techniques, Infinispan is not supporting an eventually
consistent model which makes it far more dangerous to read the
slightly out of date value from the non-owner... sure we can resolve
those things, but it gets hairy.

In this case specifically, reading from the primary owner only seems a
much "cleaner" solution, as IMHO it's a move towards simplification
rather than adding yet another special case in the codebase.

>
>>
>> Sending invalidations from a non-primary owner is an interesting
>> approach, but then we're having each owner to maintain an independent
>> list of nodes who have read the value.
>> For each write, the primary node would send an invalidation to each
>> registered node, plus the copy to the secondary nodes, which in turn
>> sends more L1 invalidation nodes to each of their registered nodes..
>> what's the likelihood of duplication of invalidation messages here?
>> Sounds like a big network traffic amplifier, lots of network traffic
>> triggered for each write.
>>
>
> The likelihood of duplication is very near to 100%, indeed, and in non-tx
> caches it would add another RPC to the critical path.
>
> As always, it's a compromise: if we do something to speed up writes, it will
> slow down reads. Perhaps we could send the request to the primary owners
> only when L1 is enabled, as the number of remote gets should be smaller, and
> send the request to all the owners when L1 is disabled, and the number of
> remote gets is higher.
>
> Pedro's suggestion to send the request to all the owners, but only write the
> value to L1 if the first reply was from the primary owner, sounds like it
> should work just as well. It would make L1 slightly less efficient, but it
> wouldn't have latency spikes caused by a delay on the primary owner.

That's far from an ideal solution; we don't have a clue on how to
measure what "slightly less efficient" means: that might reveal to be
"unbearably worse" for some usage pattern.

While we have no clue how worse it can be, it will definitely always
provide a worse cache hit/miss ratio, so it's easily proven that it's
going to be sub optimal in all cases.

If you really go for something like that, at least take the value from
the primary owners when it arrives (second, third, .. whatever but at
some point you might get it) and then store it in L1: will cost you a
second unmarshalling operation but that's far better than causing
(several?) cache misses.

>
>>
>> It also implies that we don't have reliability on the list of
>> registered nodes, as each owner will be maintaining a different set.
>> In this case we should also have each node invalidate its L1 stored
>> entries when the node from which they got these entries has left the
>> cluster.
>>
>
> Right now we invalidate from L1 all the keys for which the list of owners
> changed, whether they're still alive or not, because we don't keep track of
> the node we got each entry from.
>
> If we only sent remote get commands to the primary owner, we'd have to
> invalidate from L1 all the keys for which the primary owner changed.
>
> One thing that we don't do at the moment, but we should do whether we send
> the invalidations from the primary owner or from all the owners, is to clean
> up the requestor lists for the keys that a node no longer owns.
>
>>
>> Having it all dealt by the primary owner makes for a much simpler
>> design and also makes it more likely that a single L1 invalidate
>> message is sent via multicast, or at least with less duplication.
>>
>
> The simplest design would be to never keep track of requestors and always
> send a multicast from the originator. In fact, the default configuration is
> to always send multicasts (but we still keep track of requestors and we send
> the invalidation from the primary owner).
>
> Intuitively, unicasts would be preferable for keys that have a low
> read:write ratio, as in a write-intensive scenario, but I wonder if
> disabling L1 wouldn't be even better for that scenario.

Well put

Cheers,
Sanne

>
> Cheers
> Dan
>
>
>>
>> Cheers,
>> Sanne
>>
>>
>>
>>
>> >
>> >
>> >
>> > On Tue, Jul 2, 2013 at 7:14 PM, Sanne Grinovero <sanne at infinispan.org>
>> > wrote:
>> >>
>> >> I see, so we keep the wrong implementation because it's faster?
>> >>
>> >> :D
>> >>
>> >> On 2 July 2013 16:38, Dan Berindei <dan.berindei at gmail.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Jul 2, 2013 at 6:36 PM, Pedro Ruivo <pedro at infinispan.org>
>> >> > wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 07/02/2013 04:21 PM, Sanne Grinovero wrote:
>> >> >> > +1 for considering it a BUG
>> >> >> >
>> >> >> > Didn't we decide a year ago that GET operations should be sent to
>> >> >> > a
>> >> >> > single node only (the primary) ?
>> >> >>
>> >> >> +1 :)
>> >> >>
>> >> >
>> >> > Manik had a patch for staggering remote GET calls, but it was slowing
>> >> > down
>> >> > reads by 25%: http://markmail.org/message/vsx46qbfzzxkkl4w
>> >> >
>> >> >>
>> >> >> >
>> >> >> > On 2 July 2013 15:59, Pedro Ruivo <pedro at infinispan.org> wrote:
>> >> >> >> Hi all,
>> >> >> >>
>> >> >> >> simple question: What are the consistency guaranties that is
>> >> >> >> supposed
>> >> >> >> to
>> >> >> >> be ensured?
>> >> >> >>
>> >> >> >> I have the following scenario (happened in a test case):
>> >> >> >>
>> >> >> >> NonOwner: remote get key
>> >> >> >> BackupOwner: receives the remote get and replies (with the
>> >> >> >> correct
>> >> >> >> value)
>> >> >> >> BackupOwner: put in L1 the value
>> >> >> >> PrimaryOwner: [at the same time] is committing a transaction that
>> >> >> >> will
>> >> >> >> update the key.
>> >> >> >> PrimaryOwer: receives the remote get after sending the commit.
>> >> >> >> The
>> >> >> >> invalidation for L1 is not sent to NonOwner.
>> >> >> >>
>> >> >> >> The test finishes and I perform a check for the key value in all
>> >> >> >> the
>> >> >> >> caches. The NonOwner returns the L1 cached value (==test fail).
>> >> >> >>
>> >> >> >> IMO, this is bug (or not) depending what guaranties we provide.
>> >> >> >>
>> >> >> >> wdyt?
>> >> >> >>
>> >> >> >> Pedro
>> >> >> >> _______________________________________________
>> >> >> >> infinispan-dev mailing list
>> >> >> >> infinispan-dev at lists.jboss.org
>> >> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >> > _______________________________________________
>> >> >> > infinispan-dev mailing list
>> >> >> > infinispan-dev at lists.jboss.org
>> >> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >> >
>> >> >> _______________________________________________
>> >> >> infinispan-dev mailing list
>> >> >> infinispan-dev at lists.jboss.org
>> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > infinispan-dev mailing list
>> >> > infinispan-dev at lists.jboss.org
>> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> _______________________________________________
>> >> infinispan-dev mailing list
>> >> infinispan-dev at lists.jboss.org
>> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > infinispan-dev mailing list
>> > infinispan-dev at lists.jboss.org
>> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


More information about the infinispan-dev mailing list