[infinispan-dev] L1 consistency for transactional caches.

Wed Jul 10 10:11:38 EDT 2013

On Tue, Jul 9, 2013 at 9:35 AM, Dan Berindei <dan.berindei at gmail.com> wrote:
>
>
>
> On Mon, Jul 8, 2013 at 12:19 AM, Sanne Grinovero <sanne at infinispan.org>
> wrote:
>>
>> On 3 July 2013 10:26, Dan Berindei <dan.berindei at gmail.com> wrote:
>> >
>> >
>> >
>> > On Tue, Jul 2, 2013 at 8:41 PM, Sanne Grinovero <sanne at infinispan.org>
>> > wrote:
>> >>
>> >> On 2 July 2013 17:24, Dan Berindei <dan.berindei at gmail.com> wrote:
>> >> > It's not wrong, sending the invalidation only from the primary owner
>> >> > is
>> >> > wrong :)
>> >>
>> >> Agreed, sending a GET operation to multiple nodes might not be wrong
>> >> per-se but is the root cause of such race conditions, and other subtle
>> >> complexities we might not even be aware of yet.
>> >>
>> >> I don't know why it was slower, but since the result doesn't make
>> >> sense we should look at it a second time rather than throwing the code
>> >> away.
>> >>
>> >
>> > It does make sense: statistically, the backup owner will sometimes reply
>> > faster than the primary owner.
>> >
>> > http://markmail.org/message/qmpn7yueym4tbnve
>> >
>> >
>> > http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/
>>
>> Of course, I remember the discussion but you can put may questions
>> marks on this decision. First off, this is doubling the load on
>> network, which is supposedly our most precious resource, so I highly
>> question how we're measuring the "benefit" of the second request.
>> If you read the articles you linked to you'll see google applies such
>> a strategy to improve the tail latency, but only sends the second
>> request when the first one is not getting a fast answer, and in their
>> tests this seems to pay off to a mere 5% of increased network usage..
>> I would say that's a significantly different level of trade off.
>> Also, opposed to Google's BigTable and Apache Cassandra which use
>> these techniques, Infinispan is not supporting an eventually
>> consistent model which makes it far more dangerous to read the
>> slightly out of date value from the non-owner... sure we can resolve
>> those things, but it gets hairy.
>>
>>
>> In this case specifically, reading from the primary owner only seems a
>> much "cleaner" solution, as IMHO it's a move towards simplification
>> rather than adding yet another special case in the codebase.
>
>
> It's cleaner, but it slows down reads, as we've seen in our own tests.
>
> Also, the consistency problems with staggered remote gets are the same as
> the consistency problems with simultaneous remote gets. So it would be much
> harder to go back and experiment with staggered gets once we simplify the
> code on the assumption that we should only ever read from the primary owner.
>
>>
>> >
>> >>
>> >> Sending invalidations from a non-primary owner is an interesting
>> >> approach, but then we're having each owner to maintain an independent
>> >> list of nodes who have read the value.
>> >> For each write, the primary node would send an invalidation to each
>> >> registered node, plus the copy to the secondary nodes, which in turn
>> >> sends more L1 invalidation nodes to each of their registered nodes..
>> >> what's the likelihood of duplication of invalidation messages here?
>> >> Sounds like a big network traffic amplifier, lots of network traffic
>> >> triggered for each write.
>> >>
>> >
>> > The likelihood of duplication is very near to 100%, indeed, and in
>> > non-tx
>> > caches it would add another RPC to the critical path.
>> >
>> > As always, it's a compromise: if we do something to speed up writes, it
>> > will
>> > slow down reads. Perhaps we could send the request to the primary owners
>> > only when L1 is enabled, as the number of remote gets should be smaller,
>> > and
>> > send the request to all the owners when L1 is disabled, and the number
>> > of
>> > remote gets is higher.
>> >
>> > Pedro's suggestion to send the request to all the owners, but only write
>> > the
>> > value to L1 if the first reply was from the primary owner, sounds like
>> > it
>> > should work just as well. It would make L1 slightly less efficient, but
>> > it
>> > wouldn't have latency spikes caused by a delay on the primary owner.
>>
>> That's far from an ideal solution; we don't have a clue on how to
>> measure what "slightly less efficient" means: that might reveal to be
>> "unbearably worse" for some usage pattern.
>
>
> True, in a worst case scenario, the primary owner could be replying
> consistently slower than the others, and the entry may never be stored in
> L1. And it wouldn't make sense to try this with numOwners = 10, the chances
> of the primary owner replying first would be slim.
>
> I think my "slightly less efficient" description would be more accurate if
> we had staggered gets, if we ever get that patch in...
>
>>
>>
>> While we have no clue how worse it can be, it will definitely always
>> provide a worse cache hit/miss ratio, so it's easily proven that it's
>> going to be sub optimal in all cases.
>
>
> If numOwners = 1 there is only one way to read the value, so it's clearly
> optimal in one case :)
>
>>
>> If you really go for something like that, at least take the value from
>> the primary owners when it arrives (second, third, .. whatever but at
>> some point you might get it) and then store it in L1: will cost you a
>> second unmarshalling operation but that's far better than causing
>> (several?) cache misses.
>
>
> That sounds nice, I wonder how easily Will could make it work with the L1
> synchronization stuff he has implemented.

Yeah it sounds like that would be a perfect fit here.  Since you would
want any write/invalidation that occurs for that key to cancel out the
L1 being updated from the primary owner get that hasn't completed yet.

>
>> >
>> >>
>> >> It also implies that we don't have reliability on the list of
>> >> registered nodes, as each owner will be maintaining a different set.
>> >> In this case we should also have each node invalidate its L1 stored
>> >> entries when the node from which they got these entries has left the
>> >> cluster.
>> >>
>> >
>> > Right now we invalidate from L1 all the keys for which the list of
>> > owners
>> > changed, whether they're still alive or not, because we don't keep track
>> > of
>> > the node we got each entry from.
>> >
>> > If we only sent remote get commands to the primary owner, we'd have to
>> > invalidate from L1 all the keys for which the primary owner changed.
>> >
>> > One thing that we don't do at the moment, but we should do whether we
>> > send
>> > the invalidations from the primary owner or from all the owners, is to
>> > clean
>> > up the requestor lists for the keys that a node no longer owns.
>> >
>> >>
>> >> Having it all dealt by the primary owner makes for a much simpler
>> >> design and also makes it more likely that a single L1 invalidate
>> >> message is sent via multicast, or at least with less duplication.
>> >>
>> >
>> > The simplest design would be to never keep track of requestors and
>> > always
>> > send a multicast from the originator. In fact, the default configuration
>> > is
>> > to always send multicasts (but we still keep track of requestors and we
>> > send
>> > the invalidation from the primary owner).
>> >
>> > Intuitively, unicasts would be preferable for keys that have a low
>> > read:write ratio, as in a write-intensive scenario, but I wonder if
>> > disabling L1 wouldn't be even better for that scenario.
>>
>> Well put
>>
>> Cheers,
>> Sanne
>>
>> >
>> > Cheers
>> > Dan
>> >
>> >
>> >>
>> >> Cheers,
>> >> Sanne
>> >>
>> >>
>> >>
>> >>
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Jul 2, 2013 at 7:14 PM, Sanne Grinovero
>> >> > <sanne at infinispan.org>
>> >> > wrote:
>> >> >>
>> >> >> I see, so we keep the wrong implementation because it's faster?
>> >> >>
>> >> >> :D
>> >> >>
>> >> >> On 2 July 2013 16:38, Dan Berindei <dan.berindei at gmail.com> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jul 2, 2013 at 6:36 PM, Pedro Ruivo <pedro at infinispan.org>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On 07/02/2013 04:21 PM, Sanne Grinovero wrote:
>> >> >> >> > +1 for considering it a BUG
>> >> >> >> >
>> >> >> >> > Didn't we decide a year ago that GET operations should be sent
>> >> >> >> > to
>> >> >> >> > a
>> >> >> >> > single node only (the primary) ?
>> >> >> >>
>> >> >> >> +1 :)
>> >> >> >>
>> >> >> >
>> >> >> > Manik had a patch for staggering remote GET calls, but it was
>> >> >> > slowing
>> >> >> > down
>> >> >> > reads by 25%: http://markmail.org/message/vsx46qbfzzxkkl4w
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> > On 2 July 2013 15:59, Pedro Ruivo <pedro at infinispan.org> wrote:
>> >> >> >> >> Hi all,
>> >> >> >> >>
>> >> >> >> >> simple question: What are the consistency guaranties that is
>> >> >> >> >> supposed
>> >> >> >> >> to
>> >> >> >> >> be ensured?
>> >> >> >> >>
>> >> >> >> >> I have the following scenario (happened in a test case):
>> >> >> >> >>
>> >> >> >> >> NonOwner: remote get key
>> >> >> >> >> BackupOwner: receives the remote get and replies (with the
>> >> >> >> >> correct
>> >> >> >> >> value)
>> >> >> >> >> BackupOwner: put in L1 the value
>> >> >> >> >> PrimaryOwner: [at the same time] is committing a transaction
>> >> >> >> >> that
>> >> >> >> >> will
>> >> >> >> >> update the key.
>> >> >> >> >> PrimaryOwer: receives the remote get after sending the commit.
>> >> >> >> >> The
>> >> >> >> >> invalidation for L1 is not sent to NonOwner.
>> >> >> >> >>
>> >> >> >> >> The test finishes and I perform a check for the key value in
>> >> >> >> >> all
>> >> >> >> >> the
>> >> >> >> >> caches. The NonOwner returns the L1 cached value (==test
>> >> >> >> >> fail).
>> >> >> >> >>
>> >> >> >> >> IMO, this is bug (or not) depending what guaranties we
>> >> >> >> >> provide.
>> >> >> >> >>
>> >> >> >> >> wdyt?
>> >> >> >> >>
>> >> >> >> >> Pedro
>> >> >> >> >> _______________________________________________
>> >> >> >> >> infinispan-dev mailing list
>> >> >> >> >> infinispan-dev at lists.jboss.org
>> >> >> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >> >> > _______________________________________________
>> >> >> >> > infinispan-dev mailing list
>> >> >> >> > infinispan-dev at lists.jboss.org
>> >> >> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >> >> >
>> >> >> >> _______________________________________________
>> >> >> >> infinispan-dev mailing list
>> >> >> >> infinispan-dev at lists.jboss.org
>> >> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > infinispan-dev mailing list
>> >> >> > infinispan-dev at lists.jboss.org
>> >> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >> _______________________________________________
>> >> >> infinispan-dev mailing list
>> >> >> infinispan-dev at lists.jboss.org
>> >> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > infinispan-dev mailing list
>> >> > infinispan-dev at lists.jboss.org
>> >> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >> _______________________________________________
>> >> infinispan-dev mailing list
>> >> infinispan-dev at lists.jboss.org
>> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > infinispan-dev mailing list
>> > infinispan-dev at lists.jboss.org
>> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev