[infinispan-dev] L1 consistency for transactional caches.

Dan Berindei dan.berindei at gmail.com
Tue Jul 9 12:35:05 EDT 2013


On Mon, Jul 8, 2013 at 12:19 AM, Sanne Grinovero <sanne at infinispan.org>wrote:

> On 3 July 2013 10:26, Dan Berindei <dan.berindei at gmail.com> wrote:
> >
> >
> >
> > On Tue, Jul 2, 2013 at 8:41 PM, Sanne Grinovero <sanne at infinispan.org>
> > wrote:
> >>
> >> On 2 July 2013 17:24, Dan Berindei <dan.berindei at gmail.com> wrote:
> >> > It's not wrong, sending the invalidation only from the primary owner
> is
> >> > wrong :)
> >>
> >> Agreed, sending a GET operation to multiple nodes might not be wrong
> >> per-se but is the root cause of such race conditions, and other subtle
> >> complexities we might not even be aware of yet.
> >>
> >> I don't know why it was slower, but since the result doesn't make
> >> sense we should look at it a second time rather than throwing the code
> >> away.
> >>
> >
> > It does make sense: statistically, the backup owner will sometimes reply
> > faster than the primary owner.
> >
> > http://markmail.org/message/qmpn7yueym4tbnve
> >
> >
> http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/
>
> Of course, I remember the discussion but you can put may questions
> marks on this decision. First off, this is doubling the load on
> network, which is supposedly our most precious resource, so I highly
> question how we're measuring the "benefit" of the second request.
> If you read the articles you linked to you'll see google applies such
> a strategy to improve the tail latency, but only sends the second
> request when the first one is not getting a fast answer, and in their
> tests this seems to pay off to a mere 5% of increased network usage..
> I would say that's a significantly different level of trade off.
> Also, opposed to Google's BigTable and Apache Cassandra which use
> these techniques, Infinispan is not supporting an eventually
> consistent model which makes it far more dangerous to read the
> slightly out of date value from the non-owner... sure we can resolve
> those things, but it gets hairy.
>
>
In this case specifically, reading from the primary owner only seems a
> much "cleaner" solution, as IMHO it's a move towards simplification
> rather than adding yet another special case in the codebase.
>

It's cleaner, but it slows down reads, as we've seen in our own tests.

Also, the consistency problems with staggered remote gets are the same as
the consistency problems with simultaneous remote gets. So it would be much
harder to go back and experiment with staggered gets once we simplify the
code on the assumption that we should only ever read from the primary owner.


> >
> >>
> >> Sending invalidations from a non-primary owner is an interesting
> >> approach, but then we're having each owner to maintain an independent
> >> list of nodes who have read the value.
> >> For each write, the primary node would send an invalidation to each
> >> registered node, plus the copy to the secondary nodes, which in turn
> >> sends more L1 invalidation nodes to each of their registered nodes..
> >> what's the likelihood of duplication of invalidation messages here?
> >> Sounds like a big network traffic amplifier, lots of network traffic
> >> triggered for each write.
> >>
> >
> > The likelihood of duplication is very near to 100%, indeed, and in non-tx
> > caches it would add another RPC to the critical path.
> >
> > As always, it's a compromise: if we do something to speed up writes, it
> will
> > slow down reads. Perhaps we could send the request to the primary owners
> > only when L1 is enabled, as the number of remote gets should be smaller,
> and
> > send the request to all the owners when L1 is disabled, and the number of
> > remote gets is higher.
> >
> > Pedro's suggestion to send the request to all the owners, but only write
> the
> > value to L1 if the first reply was from the primary owner, sounds like it
> > should work just as well. It would make L1 slightly less efficient, but
> it
> > wouldn't have latency spikes caused by a delay on the primary owner.
>
> That's far from an ideal solution; we don't have a clue on how to
> measure what "slightly less efficient" means: that might reveal to be
> "unbearably worse" for some usage pattern.
>

True, in a worst case scenario, the primary owner could be replying
consistently slower than the others, and the entry may never be stored in
L1. And it wouldn't make sense to try this with numOwners = 10, the chances
of the primary owner replying first would be slim.

I think my "slightly less efficient" description would be more accurate if
we had staggered gets, if we ever get that patch in...


>
> While we have no clue how worse it can be, it will definitely always
> provide a worse cache hit/miss ratio, so it's easily proven that it's
> going to be sub optimal in all cases.
>

If numOwners = 1 there is only one way to read the value, so it's clearly
optimal in one case :)


> If you really go for something like that, at least take the value from
> the primary owners when it arrives (second, third, .. whatever but at
> some point you might get it) and then store it in L1: will cost you a
> second unmarshalling operation but that's far better than causing
> (several?) cache misses.
>

That sounds nice, I wonder how easily Will could make it work with the L1
synchronization stuff he has implemented.

 >
> >>
> >> It also implies that we don't have reliability on the list of
> >> registered nodes, as each owner will be maintaining a different set.
> >> In this case we should also have each node invalidate its L1 stored
> >> entries when the node from which they got these entries has left the
> >> cluster.
> >>
> >
> > Right now we invalidate from L1 all the keys for which the list of owners
> > changed, whether they're still alive or not, because we don't keep track
> of
> > the node we got each entry from.
> >
> > If we only sent remote get commands to the primary owner, we'd have to
> > invalidate from L1 all the keys for which the primary owner changed.
> >
> > One thing that we don't do at the moment, but we should do whether we
> send
> > the invalidations from the primary owner or from all the owners, is to
> clean
> > up the requestor lists for the keys that a node no longer owns.
> >
> >>
> >> Having it all dealt by the primary owner makes for a much simpler
> >> design and also makes it more likely that a single L1 invalidate
> >> message is sent via multicast, or at least with less duplication.
> >>
> >
> > The simplest design would be to never keep track of requestors and always
> > send a multicast from the originator. In fact, the default configuration
> is
> > to always send multicasts (but we still keep track of requestors and we
> send
> > the invalidation from the primary owner).
> >
> > Intuitively, unicasts would be preferable for keys that have a low
> > read:write ratio, as in a write-intensive scenario, but I wonder if
> > disabling L1 wouldn't be even better for that scenario.
>
> Well put
>
> Cheers,
> Sanne
>
> >
> > Cheers
> > Dan
> >
> >
> >>
> >> Cheers,
> >> Sanne
> >>
> >>
> >>
> >>
> >> >
> >> >
> >> >
> >> > On Tue, Jul 2, 2013 at 7:14 PM, Sanne Grinovero <sanne at infinispan.org
> >
> >> > wrote:
> >> >>
> >> >> I see, so we keep the wrong implementation because it's faster?
> >> >>
> >> >> :D
> >> >>
> >> >> On 2 July 2013 16:38, Dan Berindei <dan.berindei at gmail.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Tue, Jul 2, 2013 at 6:36 PM, Pedro Ruivo <pedro at infinispan.org>
> >> >> > wrote:
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 07/02/2013 04:21 PM, Sanne Grinovero wrote:
> >> >> >> > +1 for considering it a BUG
> >> >> >> >
> >> >> >> > Didn't we decide a year ago that GET operations should be sent
> to
> >> >> >> > a
> >> >> >> > single node only (the primary) ?
> >> >> >>
> >> >> >> +1 :)
> >> >> >>
> >> >> >
> >> >> > Manik had a patch for staggering remote GET calls, but it was
> slowing
> >> >> > down
> >> >> > reads by 25%: http://markmail.org/message/vsx46qbfzzxkkl4w
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> > On 2 July 2013 15:59, Pedro Ruivo <pedro at infinispan.org> wrote:
> >> >> >> >> Hi all,
> >> >> >> >>
> >> >> >> >> simple question: What are the consistency guaranties that is
> >> >> >> >> supposed
> >> >> >> >> to
> >> >> >> >> be ensured?
> >> >> >> >>
> >> >> >> >> I have the following scenario (happened in a test case):
> >> >> >> >>
> >> >> >> >> NonOwner: remote get key
> >> >> >> >> BackupOwner: receives the remote get and replies (with the
> >> >> >> >> correct
> >> >> >> >> value)
> >> >> >> >> BackupOwner: put in L1 the value
> >> >> >> >> PrimaryOwner: [at the same time] is committing a transaction
> that
> >> >> >> >> will
> >> >> >> >> update the key.
> >> >> >> >> PrimaryOwer: receives the remote get after sending the commit.
> >> >> >> >> The
> >> >> >> >> invalidation for L1 is not sent to NonOwner.
> >> >> >> >>
> >> >> >> >> The test finishes and I perform a check for the key value in
> all
> >> >> >> >> the
> >> >> >> >> caches. The NonOwner returns the L1 cached value (==test fail).
> >> >> >> >>
> >> >> >> >> IMO, this is bug (or not) depending what guaranties we provide.
> >> >> >> >>
> >> >> >> >> wdyt?
> >> >> >> >>
> >> >> >> >> Pedro
> >> >> >> >> _______________________________________________
> >> >> >> >> infinispan-dev mailing list
> >> >> >> >> infinispan-dev at lists.jboss.org
> >> >> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >> >> > _______________________________________________
> >> >> >> > infinispan-dev mailing list
> >> >> >> > infinispan-dev at lists.jboss.org
> >> >> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >> >> >
> >> >> >> _______________________________________________
> >> >> >> infinispan-dev mailing list
> >> >> >> infinispan-dev at lists.jboss.org
> >> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >> >
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > infinispan-dev mailing list
> >> >> > infinispan-dev at lists.jboss.org
> >> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >> _______________________________________________
> >> >> infinispan-dev mailing list
> >> >> infinispan-dev at lists.jboss.org
> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > infinispan-dev mailing list
> >> > infinispan-dev at lists.jboss.org
> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >> _______________________________________________
> >> infinispan-dev mailing list
> >> infinispan-dev at lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20130709/2c41c897/attachment-0001.html 


More information about the infinispan-dev mailing list