As a FYI. We've detected that sometimes PFER lock acquisition succeed
after > 0ms, which is odd. We're looking into this.
------- Forwarded message -------
From: "Galder Zamarreno" <galder(a)redhat.com>
To: "Galder Zamarreno" <galder(a)redhat.com>
Cc:
Subject: Re: Fwd: Re: Random ConcurrentWriteTest.testManyUsers() failure
Date: Fri, 19 Feb 2010 09:59:46 +0100
As a fyi as well, OwnableReentrantStripedLockContainer is in use here
hence the lock impl is OwnableReentrantLock.
On Fri, 19 Feb 2010 09:42:55 +0100, Galder Zamarreno <galder(a)redhat.com>
wrote:
Hi guys,
As a fyi, I've been looking at a random Infinispan 2LC testsuite failure
for a while and based on the logs below, I suspect there might be a bug
in PFER.
In 03, Thread 5 tries to acquire a lock for PFER and successfully
acquires it in 12, 5ms after the acquisition attempt is logged. PFER is
0 lock timeout, so that, in theory, should have failed.
I'm looking into this.
In case you wanna check the logs as well,
http://drop.io/ypmes4v
contains the logs.
Cheers,
------- Forwarded message -------
From: "Galder Zamarreno" <galder(a)redhat.com>
To: "Brian Stansberry" <brian.stansberry(a)redhat.com>
Cc:
Subject: Re: Random ConcurrentWriteTest.testManyUsers() failure
Date: Thu, 18 Feb 2010 12:28:42 +0100
Hi again,
I added more logging. It appears that PutFromLoadValidator is not 100%
fail proof. This is what is happening in hibernate.log attached (I'll use
the absolute times that are shorter):
Note: @ signs are followed by system identity hashcodes to differentiate
instances.
01. 20780: Thread 5: PutFromLoadValidator: Checks whether put is valid
and
it's valid since it's in pending puts
02. 20781: Thread 5: Invokes PFER for Customer.contacts#3 with
CollectionCacheEntry[]
03. 20781: Thread 5: Attempting to lock
MarshalledValue(instance=org.hibernate.test.cache.infinispan.functional.Customer.contacts#3.
04. 20784: Thread 3: PutFromLoadValidator: Record recent removal
05. 20784: Thread 3: Within GlobalTransaction:<eq-35632>:7276, invoked
with command RemoveCommand for Customer.contacts#3
06. 20784: Thread 3: Attempting to lock
MarshalledValue(instance=org.hibernate.test.cache.infinispan.functional.Customer.contacts#3.
07. 20784: Thread 3: Successfully acquired lock!
08. 20784: Thread 3: Executes RemoveCommand on Customer.contacts#3
09. 20785: Thread 3: At commit time, Customer.contacts#3 is updated with
removed=true valid=false changed=true created=false ...
10. 20785: Thread 3: Attempting to unlock Customer.contacts#3
12. 20786: Thread 5: Successfully acquired lock!
13. 20786: Thread 5: Executes PFER and updates Customer.contacts#3 with
MarshalledValue(instance=CollectionCacheEntry[]; cachedHashCode=0;
serialized=false)@c63847
14. 20794: Thread 3: Within GlobalTransaction:<eq-35632>:7280, calls get
on Customer.contacts#3
15. 20794: Thread 3: Fetches from container and returns
MarshalledValue(instance=CollectionCacheEntry[]; cachedHashCode=0;
serialized=false)@c63847
So, as you can see, the PFER and remove calls interleaved in such way
that
putFromLoad() is allowed to get in while there's a collection
modification
going on. Lock attempts/releases look ok. Now, can we record a
forthcoming
removal as a result of a collection update earlier than where is
happening
in the log? Maybe an option would be to instead of recording a removal of
the collection cache, record when an entity is updated? We could do a
pre-invalidation, similar to what happens with update timestamps after an
entity is modified. This would reduce the chances of a situation like the
one above happening.
WDYT?
On Thu, 18 Feb 2010 09:51:53 +0100, Galder Zamarreno <galder(a)redhat.com>
wrote:
> Hmmmm, looks very similar actually, if not exactly the same scenario.
> If I look more closely in the logs, I see:
>
> ...
> 12:22:12,152: Thread 3 insets new Contact#73 but JDBC connection is not
> flushed yet!!!
> 12:22:12,152: Thread 3 calls put on entity cache for Contact#73
> 12:22:12,164: Thread 3 calls remove on Customer.contacts#3 // so that
> removes collection
> ...
> 12:22:12,189: Thread 5 updates Customer.contacts#3 to
> CollectionCacheEntry[] (via a PFER call)
> ...
>
> Now, if read correctly, the validator should be stopping Thread 5 from
> loading an empty collection, since the invalidation period is 10
> seconds.
>
> I'll add more logging around this to see if I can see why the validator
> is not working correctly cos I did port the validator and the tests
> back in Sept/Oct
>
> Thanks for the pointer!
>
> On Wed, 17 Feb 2010 17:47:42 +0100, Brian Stansberry
> <brian.stansberry(a)redhat.com> wrote:
>
>> Sorry; still haven't been able to really dig in on this. Have you
>> looked at
>>
http://opensource.atlassian.com/projects/hibernate/browse/HHH-3817 ?
>>
>> On 02/17/2010 06:45 AM, Galder Zamarreno wrote:
>>> Hi Brian,
>>>
>>> So, I've removed the assertions from the 2nd transaction, which is the
>>> one that checks that the contact exists after addition, or the contact
>>> does not exist after the removal. However, the test does still
>>> randomly
>>> fail. Let me explain:
>>>
>>> 12:22:12,098: Thread 3 removed Contact#64 from Customer#3
>>> successfully.
>>> Customer#3 has no Contacts
>>> 12:22:12,144: Thread 3 starts process to add new Contact to Customer#3
>>> 12:22:12,152: Thread 3 insets new Contact#73 but JDBC connection is
>>> not
>>> flushed yet!!!
>>> 12:22:12,157: Thread 5 starts reading all Customer's 1st contact
>>> 12:22:12,160: Thread 5 checks the Customer.contacts#3 collection cache
>>> and finds nothing.
>>> 12:22:12,160: Thread 5 checks Contacts table in database and finds
>>> nothing
>>> 12:22:12,165: Thread 3 flushes JDBC connection and Contact#73 is now
>>> seen by the rest
>>> 12:22:12,189: Thread 5 updates Customer.contacts#3 to
>>> CollectionCacheEntry[] (via a PFER call)
>>> 12:22:12,196: Thread 3 uses the Session to get Customer#3's contact
>>> but
>>> gets nothing, since CollectionCacheEntry[] is returned by the
>>> collection
>>> cache.
>>> 12:22:12,262: Thread 3 tries to remove Customer#3's contact, but since
>>> there's none, it fails.
>>>
>>> The culprit would appear to me to be line in 12:22:12,189. If it
>>> doesn't
>>> find any contacts, it initialises to CollectionCacheEntry[], which is
>>> different to not storing anything at all. If nothing had been stored,
>>> 12:22:12,196 wouldn't have returned empty but instead would have gone
>>> to
>>> the database and retrieved it from there. However, we, in the 2LC, we
>>> don't control what's stored, this comes from whatever Hibernate
passes
>>> us, via CollectionLoadContext. Not sure how this can be fixed...
>>> Anything that u can think of?
>>>
>>> Cheers,
>>>
>>> On Thu, 28 Jan 2010 18:34:24 +0100, Brian Stansberry
>>> <brian.stansberry(a)redhat.com> wrote:
>>>
>>>> Have fun on holidays! Better ping me about this when you get back.
>>>>
>>>> On 01/28/2010 10:01 AM, Galder Zamarreno wrote:
>>>>>
>>>>>
>>>>> On 01/28/2010 03:46 PM, Brian Stansberry wrote:
>>>>>> Quick response; I'll look more later: if that scenario is
what is
>>>>>> happening, the test isn't correct. There's no reason
Thread 2 *in a
>>>>>> second tx* should have any expectations.
>>>>>
>>>>> Thanks for the quick reply. There's no rush at all on this.
Remember
>>>>> that I'll be away on PTO next week. Back on the 8th.
>>>>>
>>>>> FYI, the expectations are checked in the contactExists() method that
>>>>> gets called after removing.
>>>>>
>>>>>>
>>>>>> On 01/28/2010 06:46 AM, Galder Zamarreno wrote:
>>>>>>> Hi Brian,
>>>>>>>
>>>>>>> I've been looking at a random failure of
>>>>>>> ConcurrentWriteTest.testManyUsers() in the Infinispan cache
>>>>>>> provider and
>>>>>>> I wanted to pass my suspicions through you to see what you
think.
>>>>>>> There's no immediate rush on this.
>>>>>>>
>>>>>>> Some notes:
>>>>>>> Customer.contacts#2 refers to Customer 2's contacts
collection
>>>>>>> cache.
>>>>>>>
>>>>>>> Sometimes the test can fail because of this:
>>>>>>>
>>>>>>> 1. 12:21:40,434: Thread 1 calls readEveryonesFirstContact
>>>>>>> 2. 12:21:40,434: Thread 1 starts tx1
>>>>>>> 3. 12:21:40,454: Thread 1 reads Contact#302 from Customer#2
from
>>>>>>> db
>>>>>>> 4. 12:21:40,483: Thread 2 calls removeContact for Customer#2
>>>>>>> 5. 12:21:40,483: Thread 2 starts tx2
>>>>>>> 6. 12:21:40,485: Thread 2 deletes Contact#302 from Customer#2
in
>>>>>>> db.
>>>>>>> 4. 12:21:40,491: Thread 2 acquires lock for updating
>>>>>>> Customer.contacts#2
>>>>>>> collection cache.
>>>>>>> 5. 12:21:40,492: Thread 2 deletes Contact#302 from
>>>>>>> Customer.contacts#2
>>>>>>> collection cache.
>>>>>>> 6. 12:21:40,492: Thread 2 releases lock
>>>>>>> 7. 12:21:40,492: Thread 2 commits tx2
>>>>>>> 7. 12:21:40,493: Thread 1 acquires lock for updating
>>>>>>> Customer.contacts#2
>>>>>>> 8. 12:21:40,493: Thread 1 updates Customer.contacts#2
collection
>>>>>>> cache
>>>>>>> with Contact#302.
>>>>>>> 9. 12:21:40,493: Thread 1 commits tx1.
>>>>>>> 10. 12:21:40,499: Thread 2 starts a tx and checks whether any
>>>>>>> Contacts
>>>>>>> exist for Customer#2.
>>>>>>> 10. 12:21:40,542: Thread 2 fails because it expects not to be
any
>>>>>>> Contacts but finds one in the cache that Thread 1 had put.
>>>>>>>
>>>>>>> It looks to me that this is a failure of the test itself. The
fact
>>>>>>> that
>>>>>>> the remove and check is done in different transactions allows
>>>>>>> another
>>>>>>> thread to come in the middle and load some data into the
cache.
>>>>>>>
>>>>>>> WDYT? Do you think I'm reading this correctly?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> p.s. I've attached the full log in case you wanna inspect
it.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
--
Galder ZamarreƱo
Sr. Software Engineer
Infinispan, JBoss Cache