PFER lock aquisitions sometimes lasting longer than 0ms

Friday, 19 February 2010

As a FYI. We've detected that sometimes PFER lock acquisition succeed  
after > 0ms, which is odd. We're looking into this.

------- Forwarded message -------
From: "Galder Zamarreno" <galder(a)redhat.com&gt;
To: "Galder Zamarreno" <galder(a)redhat.com&gt;
Cc:
Subject: Re: Fwd: Re: Random ConcurrentWriteTest.testManyUsers() failure
Date: Fri, 19 Feb 2010 09:59:46 +0100

As a fyi as well, OwnableReentrantStripedLockContainer is in use here
hence the lock impl is OwnableReentrantLock.

On Fri, 19 Feb 2010 09:42:55 +0100, Galder Zamarreno <galder(a)redhat.com&gt;
wrote:

...
 Hi guys,

 As a fyi, I've been looking at a random Infinispan 2LC testsuite failure  
 for a while and based on the logs below, I suspect there might be a bug  
 in PFER.

 In 03, Thread 5 tries to acquire a lock for PFER and successfully  
 acquires it in 12, 5ms after the acquisition attempt is logged. PFER is  
 0 lock timeout, so that, in theory, should have failed.

 I'm looking into this.

 In case you wanna check the logs as well, http://drop.io/ypmes4v  
 contains the logs.

 Cheers,

 ------- Forwarded message -------
 From: "Galder Zamarreno" <galder(a)redhat.com&gt;
 To: "Brian Stansberry" <brian.stansberry(a)redhat.com&gt;
 Cc:
 Subject: Re: Random ConcurrentWriteTest.testManyUsers() failure
 Date: Thu, 18 Feb 2010 12:28:42 +0100

 Hi again,

 I added more logging. It appears that PutFromLoadValidator is not 100%
 fail proof. This is what is happening in hibernate.log attached (I'll use
 the absolute times that are shorter):

 Note: @ signs are followed by system identity hashcodes to differentiate
 instances.

 01. 20780: Thread 5: PutFromLoadValidator: Checks whether put is valid  
 and
 it's valid since it's in pending puts
 02. 20781: Thread 5: Invokes PFER for Customer.contacts#3 with
 CollectionCacheEntry[]
 03. 20781: Thread 5: Attempting to lock

MarshalledValue(instance=org.hibernate.test.cache.infinispan.functional.Customer.contacts#3.
 04. 20784: Thread 3: PutFromLoadValidator: Record recent removal
 05. 20784: Thread 3: Within GlobalTransaction:<eq-35632>:7276, invoked
 with command RemoveCommand for Customer.contacts#3
 06. 20784: Thread 3: Attempting to lock

MarshalledValue(instance=org.hibernate.test.cache.infinispan.functional.Customer.contacts#3.
 07. 20784: Thread 3: Successfully acquired lock!
 08. 20784: Thread 3: Executes RemoveCommand on Customer.contacts#3
 09. 20785: Thread 3: At commit time, Customer.contacts#3 is updated with
 removed=true valid=false changed=true created=false ...
 10. 20785: Thread 3: Attempting to unlock Customer.contacts#3
 12. 20786: Thread 5: Successfully acquired lock!
 13. 20786: Thread 5: Executes PFER and updates Customer.contacts#3 with
 MarshalledValue(instance=CollectionCacheEntry[]; cachedHashCode=0;
 serialized=false)@c63847
 14. 20794: Thread 3: Within GlobalTransaction:<eq-35632>:7280, calls get
 on Customer.contacts#3
 15. 20794: Thread 3: Fetches from container and returns
 MarshalledValue(instance=CollectionCacheEntry[]; cachedHashCode=0;
 serialized=false)@c63847

 So, as you can see, the PFER and remove calls interleaved in such way  
 that
 putFromLoad() is allowed to get in while there's a collection  
 modification
 going on. Lock attempts/releases look ok. Now, can we record a  
 forthcoming
 removal as a result of a collection update earlier than where is  
 happening
 in the log? Maybe an option would be to instead of recording a removal of
 the collection cache, record when an entity is updated? We could do a
 pre-invalidation, similar to what happens with update timestamps after an
 entity is modified. This would reduce the chances of a situation like the
 one above happening.

 WDYT?

 On Thu, 18 Feb 2010 09:51:53 +0100, Galder Zamarreno <galder(a)redhat.com&gt;
 wrote:

> Hmmmm, looks very similar actually, if not exactly the same scenario.  
> If I look more closely in the logs, I see:
>
> ...
> 12:22:12,152: Thread 3 insets new Contact#73 but JDBC connection is not  
> flushed yet!!!
> 12:22:12,152: Thread 3 calls put on entity cache for Contact#73
> 12:22:12,164: Thread 3 calls remove on Customer.contacts#3 // so that  
> removes collection
> ...
> 12:22:12,189: Thread 5 updates Customer.contacts#3 to  
> CollectionCacheEntry[] (via a PFER call)
> ...
>
> Now, if read correctly, the validator should be stopping Thread 5 from  
> loading an empty collection, since the invalidation period is 10  
> seconds.
>
> I'll add more logging around this to see if I can see why the validator  
> is not working correctly cos I did port the validator and the tests  
> back in Sept/Oct
>
> Thanks for the pointer!
>
> On Wed, 17 Feb 2010 17:47:42 +0100, Brian Stansberry  
> <brian.stansberry(a)redhat.com&gt; wrote:
>
>> Sorry; still haven't been able to really dig in on this. Have you  
>> looked at  
>> http://opensource.atlassian.com/projects/hibernate/browse/HHH-3817 ?
>>
>> On 02/17/2010 06:45 AM, Galder Zamarreno wrote:
>>> Hi Brian,
>>>
>>> So, I've removed the assertions from the 2nd transaction, which is the
>>> one that checks that the contact exists after addition, or the contact
>>> does not exist after the removal. However, the test does still  
>>> randomly
>>> fail. Let me explain:
>>>
>>> 12:22:12,098: Thread 3 removed Contact#64 from Customer#3  
>>> successfully.
>>> Customer#3 has no Contacts
>>> 12:22:12,144: Thread 3 starts process to add new Contact to Customer#3
>>> 12:22:12,152: Thread 3 insets new Contact#73 but JDBC connection is  
>>> not
>>> flushed yet!!!
>>> 12:22:12,157: Thread 5 starts reading all Customer's 1st contact
>>> 12:22:12,160: Thread 5 checks the Customer.contacts#3 collection cache
>>> and finds nothing.
>>> 12:22:12,160: Thread 5 checks Contacts table in database and finds  
>>> nothing
>>> 12:22:12,165: Thread 3 flushes JDBC connection and Contact#73 is now
>>> seen by the rest
>>> 12:22:12,189: Thread 5 updates Customer.contacts#3 to
>>> CollectionCacheEntry[] (via a PFER call)
>>> 12:22:12,196: Thread 3 uses the Session to get Customer#3's contact  
>>> but
>>> gets nothing, since CollectionCacheEntry[] is returned by the  
>>> collection
>>> cache.
>>> 12:22:12,262: Thread 3 tries to remove Customer#3's contact, but since
>>> there's none, it fails.
>>>
>>> The culprit would appear to me to be line in 12:22:12,189. If it  
>>> doesn't
>>> find any contacts, it initialises to CollectionCacheEntry[], which is
>>> different to not storing anything at all. If nothing had been stored,
>>> 12:22:12,196 wouldn't have returned empty but instead would have gone  
>>> to
>>> the database and retrieved it from there. However, we, in the 2LC, we
>>> don't control what's stored, this comes from whatever Hibernate
passes
>>> us, via CollectionLoadContext. Not sure how this can be fixed...
>>> Anything that u can think of?
>>>
>>> Cheers,
>>>
>>> On Thu, 28 Jan 2010 18:34:24 +0100, Brian Stansberry
>>> <brian.stansberry(a)redhat.com&gt; wrote:
>>>
>>>> Have fun on holidays! Better ping me about this when you get back.
>>>>
>>>> On 01/28/2010 10:01 AM, Galder Zamarreno wrote:
>>>>>
>>>>>
>>>>> On 01/28/2010 03:46 PM, Brian Stansberry wrote:
>>>>>> Quick response; I'll look more later: if that scenario is
what is
>>>>>> happening, the test isn't correct. There's no reason
Thread 2 *in a
>>>>>> second tx* should have any expectations.
>>>>>
>>>>> Thanks for the quick reply. There's no rush at all on this.
Remember
>>>>> that I'll be away on PTO next week. Back on the 8th.
>>>>>
>>>>> FYI, the expectations are checked in the contactExists() method that
>>>>> gets called after removing.
>>>>>
>>>>>>
>>>>>> On 01/28/2010 06:46 AM, Galder Zamarreno wrote:
>>>>>>> Hi Brian,
>>>>>>>
>>>>>>> I've been looking at a random failure of
>>>>>>> ConcurrentWriteTest.testManyUsers() in the Infinispan cache
>>>>>>> provider and
>>>>>>> I wanted to pass my suspicions through you to see what you
think.
>>>>>>> There's no immediate rush on this.
>>>>>>>
>>>>>>> Some notes:
>>>>>>> Customer.contacts#2 refers to Customer 2's contacts
collection  
>>>>>>> cache.
>>>>>>>
>>>>>>> Sometimes the test can fail because of this:
>>>>>>>
>>>>>>> 1. 12:21:40,434: Thread 1 calls readEveryonesFirstContact
>>>>>>> 2. 12:21:40,434: Thread 1 starts tx1
>>>>>>> 3. 12:21:40,454: Thread 1 reads Contact#302 from Customer#2
from  
>>>>>>> db
>>>>>>> 4. 12:21:40,483: Thread 2 calls removeContact for Customer#2
>>>>>>> 5. 12:21:40,483: Thread 2 starts tx2
>>>>>>> 6. 12:21:40,485: Thread 2 deletes Contact#302 from Customer#2
in  
>>>>>>> db.
>>>>>>> 4. 12:21:40,491: Thread 2 acquires lock for updating
>>>>>>> Customer.contacts#2
>>>>>>> collection cache.
>>>>>>> 5. 12:21:40,492: Thread 2 deletes Contact#302 from  
>>>>>>> Customer.contacts#2
>>>>>>> collection cache.
>>>>>>> 6. 12:21:40,492: Thread 2 releases lock
>>>>>>> 7. 12:21:40,492: Thread 2 commits tx2
>>>>>>> 7. 12:21:40,493: Thread 1 acquires lock for updating
>>>>>>> Customer.contacts#2
>>>>>>> 8. 12:21:40,493: Thread 1 updates Customer.contacts#2
collection  
>>>>>>> cache
>>>>>>> with Contact#302.
>>>>>>> 9. 12:21:40,493: Thread 1 commits tx1.
>>>>>>> 10. 12:21:40,499: Thread 2 starts a tx and checks whether any

>>>>>>> Contacts
>>>>>>> exist for Customer#2.
>>>>>>> 10. 12:21:40,542: Thread 2 fails because it expects not to be
any
>>>>>>> Contacts but finds one in the cache that Thread 1 had put.
>>>>>>>
>>>>>>> It looks to me that this is a failure of the test itself. The
fact
>>>>>>> that
>>>>>>> the remove and check is done in different transactions allows

>>>>>>> another
>>>>>>> thread to come in the middle and load some data into the
cache.
>>>>>>>
>>>>>>> WDYT? Do you think I'm reading this correctly?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> p.s. I've attached the full log in case you wanna inspect
it.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

-- 
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009