[infinispan-dev] DIST-SYNC, put(), a problem and a solution

Tue Aug 12 10:05:54 EDT 2014

Sorry for the long delay, Sanne!

On Wed, Aug 6, 2014 at 9:50 PM, Sanne Grinovero <sanne at infinispan.org>
wrote:

> On 6 August 2014 18:49, Dan Berindei <dan.berindei at gmail.com> wrote:
> >
> >
> >
> > On Wed, Aug 6, 2014 at 6:19 PM, Bela Ban <bban at redhat.com> wrote:
> >>
> >> Hey Dan,
> >>
> >> On 06/08/14 16:13, Dan Berindei wrote:
> >> > I could create the issue in JIRA, but I wouldn't make it high priority
> >> > because I think it have lots of corner cases with NBST and cause
> >> > headaches for the maintainers of state transfer ;)
> >>
> >> I do believe the put-while-holding-the-lock issue *is* a critical issue;
> >> anyone banging a cluster of Infinispan nodes with more than 1 thread
> >> will run into lock timeouts, with or without transactions. The only
> >> workaround for now is to use total order, but at the cost of reduced
> >> performance. However, once a system starts hitting the lock timeout
> >> issues, performance drops to a crawl, way slower than TO, and work
> >> starts to pile up, which compounds the problem.
> >
> >
> > I wouldn't call it critical because you can always increase the number of
> > threads. It won't be pretty, but it will work around the thread
> exhaustion
> > issue.
>
> If Infinispan doesn't do it automatically, I wouldn't count that as a
> solution.
> Consider the project goal is to make it easy to scale up/down
> dynamically.. if it requires experts to be alert all the time to for
> such manual interventions it's a failure.
> Besides, I can count the names of people able to figure such trouble
> out on a single hand.. so I agree this is critical.
>
>
I agree it's not a solution, just a workaround.

I also agree that when something (unrelated) goes wrong, the thread pools
can fill quite quickly and then it becomes hard (or even impossible) to
recover. So the pool sizes have to be big enough to handle a backlog of 10
seconds or more (until FD/FD_ALL suspect a crashed node), not just regular
operation.

>
> >> I believe doing a sync RPC while holding the lock on a key is asking for
> >> trouble and is (IMO) an anti-pattern.
> >
> >
> > We also hold a lock on a key between the LockControlCommand and the
> > TxCompletionNotificationCommand in pessimistic-locking caches, and
> there's
> > at least one sync PrepareCommand RPC between them...
> >
> > So I don't see it as an anti-pattern, the only problem is that we should
> be
> > able to do that without blocking internal threads in addition to the user
> > thread (which is how tx caches do it).
> >
> >>
> >> Sorry if this has a negative impact on NBST, but should we not fix this
> >> because we don't want to risk a change to NBST ?
> >
> >
> > I'm not saying it will have a negative impact on NBST, I'm just saying I
> > don't want to start implementing an incomplete proposal for the basic
> flow
> > and leave the state transfer/topology change issues for "later". When
> > happens when a node leaves, when a backup owner is added, or when the
> > primary owner changes should be part of the initial discussion, not an
> > afterthought.
>
> Absolutely!
> No change should be done leaving questions open, and I don't presume I
> suggested a solution I was just trying to start a conversation on
> using "such a pattern".
> But also I believe we already had such conversations in past meetings,
> so my words were terse and short because I just wanted to remind about
> those.
>
>
I don't recall discussing the non-tx locking scheme before, only the state
machine approach...

>
> > E.g. with your proposal, any updates in the replication queue on the
> primary
> > owner will be lost when that primary owner dies, even though we told the
> > user that we successfully updated the key. To quote from my first email
> on
> > this thread: "OTOH, if the primary owner dies, we have to ask a backup,
> and
> > we can lose the modifications not yet replicated by the primary."
> >
> > With Sanne's proposal, we wouldn't report to the user that we stored the
> > value until all the backups confirmed the update, so we wouldn't have
> that
> > problem. But I don't see how we could keep the sequence of versions
> > monotonous when the primary owner of the key changes without some extra
> sync
> > RPCs (also done while holding the key lock). IIRC TOA also needs some
> sync
> > RPCs to generate its sequence numbers.
>
> I don't know how NBST v.21 is working today, but I trust it doesn't
> lose writes and that we should break down the problems in smaller
> problems, in this case I hope to build on the solid foundations of
> NBST.
>

Except NBST preserves written data, but it doesn't care about writes in
progress (at least in non-tx caches) ;)

The replication algorithm takes care of that: when a backup owner sees a
newer topology, it throws an OutdatedTopologyException, the originator
receives the exception, it waits to receive the new topology, and it
retries the operation on the new (or the same) primary owner. Still, there
are times when this is enough:

https://issues.jboss.org/browse/ISPN-3830
https://issues.jboss.org/browse/ISPN-4286
https://issues.jboss.org/browse/ISPN-3918

We were considering using random client-generated version numbers to fix
some of the issues: when retrying, the client would use the same version
number, and it would be easy to detect whether the update has been applied
on a particular node or not. There is still a chance to end up with
inconsistent data if the client is the same as the primary owner (or they
both die at approximately the same time) - I don't think we can ever fix
that in non-tx mode.

> When the key is re-possessed by a new node (and this starts to
> generate "reference" write commands), you could restart the sequences:
> you don't need an universal monotonic number, all what backup owners
> need is an ordering rule and understand that the commands coming from
> the new owner are more recent than the old owner. AFAIK you already
> have the notion of view generation id?
>

"universal monotonic number" and a global "ordering rule" sound exactly the
same to me :)
We do have the notion of topology id (didn't use "view" to avoid confusion
with jgroups views).

> Essentially we'd need to store together with the entry not only its
> sequence but also the viewid. It's a very simplified (compact) vector
> clock, because in practice from this viewId we can extrapolate
> addresses and owners.. but is simpler than the full pattern, as you
> only need the last one, as the longer tail of events is handled by
> NBST I think?
>

You're right, adding the topology id to a monotone-per-node version number
would give you a monotonic sequence (with holes in it).

But when we retry a command, the new primary owner would generate a
different version number, so we would get a different ordering for the same
write operation. We could return the version number to the client and allow
it to retry with the same version number, but that would still fail if the
primary owner died.

>
> One catch is I think you need tombstones, but those are already needed
> for so many things that we can't avoid them :)
>

We're not talking about removes yet, so we can postpone the discussion
about tombstones for now :)

>
> Cheers,
> Sanne
>
>
> >
> >>
> >> > Besides, I'm still not sure I understood your proposals properly, e.g.
> >> > whether they are meant only for non-tx caches or you want to change
> >> > something for tx caches as well...
> >>
> >> I think this can be used for both cases; however, I think either Sanne's
> >> solution of using seqnos *per key* and updating in the order of seqnos
> >> or using Pedro's total order impl are probably better solutions.
> >>
> >> I'm not pretending these solutions are final (e.g. Sanne's solution
> >> needs more thought when multiple keys are involved), but we should at
> >> least acknowledge the issue exists, create a JIRA to prioritize it and
> >> then start discussing solutions.
> >>
> >
> > We've been discussing solutions without a JIRA just fine :)
> >
> > My feeling so far is that the thread exhaustion problem would be better
> > served by porting TO to non-tx caches and/or changing non-tx locking to
> not
> > require a thread. I have created an issue for TO [1], but IMO the locking
> > rework [2] should be higher priority, as it can help both tx and non-tx
> > caches.
> >
> > [1] https://issues.jboss.org/browse/ISPN-4610
> > [2] https://issues.jboss.org/browse/ISPN-2849
> >
> >>
> >> >
> >> >
> >> > On Wed, Aug 6, 2014 at 1:02 PM, Bela Ban <bban at redhat.com
> >> > <mailto:bban at redhat.com>> wrote:
> >> >
> >> >     Seems like this discussion has died with the general agreement
> that
> >> > this
> >> >     is broken and with a few proposals on how to fix it, but without
> any
> >> >     follow-up action items.
> >> >
> >> >     I think we (= someone from the ISPN team) need to create a JIRA,
> >> >     preferably blocking.
> >> >
> >> >     WDYT ?
> >> >
> >> >     If not, here's what our options are:
> >> >
> >> >     #1 I'll create a JIRA
> >> >
> >> >     #2 We'll hold the team meeting in Krasnojarsk, Russia
> >> >
> >> >     #3 There will be only vodka, no beers in #2
> >> >
> >> >     #4 Bela will join the ISPN team
> >> >
> >> >     Thoughts ?
> >>
> >>
> >> --
> >> Bela Ban, JGroups lead (http://www.jgroups.org)
> >> _______________________________________________
> >> infinispan-dev mailing list
> >> infinispan-dev at lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > infinispan-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20140812/78856ece/attachment-0001.html