[infinispan-dev] query repl timeout

Mon Apr 8 08:04:18 EDT 2013

>> Ales is this error happening after a node failure?

No node failure that I'm aware of.

We did get some unexpected NPE in DataNucleus framework,
but, imo, that shouldn't completely kill the app.

We'll re-try.

And then also re-try with no locking.

> Or make something clever based on JGroups views
> 
> default.locking_strategy = fully.qualified.custom.Implementation

@Bela, Sanne: how would this look like?

As this looks like the best workaround for now -- if this is really the issue.

-Ales

On Apr 8, 2013, at 1:19 PM, Sanne Grinovero <sanne at infinispan.org> wrote:

> There should be no locking contention at all, that is the whole point
> of using such a backend and forwarding changes to a single node: that
> only a single node ever attempts to acquire this lock. Hence the error
> is a simptom of some previous error, I primarily suspect cluster view
> stability.
> 
> I indeed have committed an experimental alternative backend in
> Infinispan Query (included in master) and another one in Hibernate
> Search (not master but a branch I'm working on);
> the one in Hibernate Search is meant to superseed the others but it's
> not working yet as I need CAS to be working in Infinispan, and this is
> still failing my tests.
> 
> The result of failing CAS is the master election: multiple nodes elect
> themselves, which results in the locking error.
> 
> Ales is this error happening after a node failure? AFAIK the missing
> feature of the JGroups based backend is that it doesn't cleanup stale
> index locks when a master fails; each master node releases the lock as
> soon as possible (as you have set exclusive_index_use=false) but if
> the node is disconnected exactly during the write operation the lock
> will need to be cleaned up forcefully. I would normally expect this to
> be very unlikely but it could be triggered if you have view stability
> problems.
> We could try integrating some kind of force-lock clean operation but
> it's quite tricky to make sure this happens safely.. there is of
> course a purpose for this lock.
> 
> You could try turning off the seatbelt by setting
> 
> default.locking_strategy = none
> 
> Or make something clever based on JGroups views
> 
> default.locking_strategy = fully.qualified.custom.Implementation
> 
> I do have a cleaner solution with proper lock cleanup routines, but
> these are based on the CAS operation too.. they are failing stress
> tests so I won't commit them yet.
> 
> Sanne
> 
> On 8 April 2013 11:38, Manik Surtani <msurtani at redhat.com> wrote:
>> 
>> On 8 Apr 2013, at 11:28, Ales Justin <ales.justin at gmail.com> wrote:
>> 
>> This "jgroups" backend was there "long" ago.
>> And it was actually us - CD - that fixed it and made use of it.
>> It's no different from static JGroups backed, the only diff that this one
>> elects master automatically.
>> 
>> I can change to Sanne's new Ispn based prototype if it will help.
>> 
>> But - with my limited cluster knowledge - the issue doesn't look to be
>> there.
>> I mean, the msgs get properly routed to indexing master, which just cannot
>> handle locking contention.
>> 
>> 
>> Any thoughts on this, Sanne?
>> 
>> 
>> -Ales
>> 
>> I believe this new backend is WIP in Hibernate Search.  Sanne, didn't you
>> have a prototype in Infinispan's codebase though?
>> 
>> On 5 Apr 2013, at 15:28, Ales Justin <ales.justin at gmail.com> wrote:
>> 
>> are you not using the JGroups backend anymore?
>> 
>> 
>> I'm using that "jgroups" backend, with auto-master election.
>> 
>> these Lock acquisitions are on the index lock, and make me suspect your
>> configuration is no longer applying the pattern we discussed a while back,
>> when you contributed the fixed to the JGroups indexing backend.
>> 
>> Or is it the "Replication timeout for mstruk/capedwarf" which is causing
>> those locking errors?
>> 
>> 
>> No idea.
>> 
>> btw: didn't you say you had some new backend mechanism?
>> Off Infinispan's channel.
>> 
>> -Ales
>> 
>> On 5 April 2013 14:56, Ales Justin <ales.justin at gmail.com> wrote:
>>> 
>>> We're running a GAE HelloOrm2 example app on 3 nodes (3 laptops).
>>> 
>>> Very soon after deploy, we get a never-ending stack of timeouts,
>>> which completely kills the app:
>>> * https://gist.github.com/alesj/5319414
>>> 
>>> I then need to kill the AS in order to get it shutdown.
>>> 
>>> How can this be tuned / fixed?
>>> 
>>> -Ales
>>> 
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> --
>> Manik Surtani
>> manik at jboss.org
>> twitter.com/maniksurtani
>> 
>> Platform Architect, JBoss Data Grid
>> http://red.ht/data-grid
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> --
>> Manik Surtani
>> manik at jboss.org
>> twitter.com/maniksurtani
>> 
>> Platform Architect, JBoss Data Grid
>> http://red.ht/data-grid
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev