[infinispan-dev] state transfer exceptions at REPL

Wed Feb 8 05:17:07 EST 2012

Sanne,

I was able to run LiveRunningTest as well after I removed
TestableJGroupsTransport from the Infinispan configuration, and I
disabled queueing in the SHARED_LOOPBACK OOB thread pool:

   <SHARED_LOOPBACK
         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="30"
         thread_pool.keep_alive_time="60000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="Discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="2"
         oob_thread_pool.max_threads="30"
         oob_thread_pool.keep_alive_time="60000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="Discard"
         />

I think the test fails with queuing enabled and core thread pool size
2 because the coordinator sends a PREPARE_VIEW command and several
APPLY_STATE commands (at least one for each cache) at approximately
the same time. If two APPLY_STATE commands get to the other node
before the PREPARE_VIEW command, they will be stuck waiting for state
transfer to start.

FD also sends messages using OOB, so if the OOB thread pool stops
processing messages FD on other members will soon suspect the stuck
member and kick it out of the cluster.

For now I think increasing the number of available threads is the only
solution. For 5.2 I'm thinking of moving both the sending of state and
the handling of state to a separate thread, so that OOB threads won't
have to block waiting for the state transfer to start.

Cheers
Dan

On Wed, Feb 8, 2012 at 9:59 AM, Dan Berindei <dan.berindei at gmail.com> wrote:
> Hi Sanne
>
> I got the sources and even TwoNodesTest hang for me every time.
>
> I think the problem is that your TestableJGroupsTransport is trying to
> modify the cluster name during startup - which is no longer supported.
>
> I have also created https://issues.jboss.org/browse/ISPN-1852 to fix
> startup so that after an error like this another getCache() call
> doesn't block forever. Ideally it should report the same error,
> whether we attempt to start the component again or we save the
> exception somewhere.
>
> Cheers
> Dan
>
>
> On Tue, Feb 7, 2012 at 6:15 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
>> Dan,
>> you can easily checkout Hibernate Search, it's a Maven project and you
>> should be able to set it up in your IDE quickly.
>>
>> git clone git://github.com/Sanne/hibernate-search.git
>> git checkout componentsUpdates
>>
>> Then the failing test is in the module "hibernate-search-infinispan"..
>> which is just a couple of classes.
>>
>> Sanne
>>
>>
>>
>> On 7 February 2012 16:10, Dan Berindei <dan.berindei at gmail.com> wrote:
>>> Rado, is there a specific test in the AS7 test suite that is failing?
>>> Is it only in Jenkins or on your machine as well?
>>>
>>> I only know about https://issues.jboss.org/browse/ISPN-1806, but Paul
>>> said that he doesn't see it any more in CI runs (he never managed to
>>> reproduce it on his machine).
>>>
>>> Cheers
>>> Dan
>>>
>>>
>>> On Tue, Feb 7, 2012 at 3:13 PM, Radoslav Husar <rhusar at redhat.com> wrote:
>>>> I am also seeing this/similar exception in AS7 during session
>>>> replication even with 5.1.1.FINAL :-(
>>>>
>>>> On 02/07/2012 01:54 PM, Dan Berindei wrote:
>>>>> Sanne, this sounds very similar to
>>>>> https://issues.jboss.org/browse/ISPN-1814, but I thought I had fixed
>>>>> that for 5.1.1.FINAL.
>>>>>
>>>>> I see CacheViewsManagerImpl is trying to install a view with 6 nodes,
>>>>> should there be 6 nodes in the cluster or should there be less nodes?
>>>>> Do you have DEBUG logs for org.infinispan and org.jgroups?
>>>>>
>>>>> Cheers
>>>>> Dan
>>>>>
>>>>>
>>>>> On Tue, Feb 7, 2012 at 12:58 PM, Sanne Grinovero<sanne at infinispan.org>  wrote:
>>>>>> Can anyone explain this error?
>>>>>>
>>>>>> I'm updating Hibernate Search, and having a simple test which in a loop does:
>>>>>>
>>>>>> - write to shared index
>>>>>> - add a node / remove a node
>>>>>> - wait for joins
>>>>>> - verifies index state
>>>>>>
>>>>>> This is expected to work, as it already did with all previous
>>>>>> Infinispan versions.
>>>>>>
>>>>>> Using Infinispan 5.1.1.FINAL and JGroups 3.0.5.Final.
>>>>>>
>>>>>> 2012-02-07 10:42:38,668 WARN  [CacheViewControlCommand]
>>>>>> (OOB-4,sanne-20017) ISPN000071: Caught exception when handling command
>>>>>> CacheViewControlCommand{cache=LuceneIndexesMetadata,
>>>>>> type=PREPARE_VIEW, sender=sanne-3158, newViewId=8,
>>>>>> newMembers=[sanne-3158, sanne-63971, sanne-20017, sanne-2794,
>>>>>> sanne-25511, sanne-30075], oldViewId=7, oldMembers=[sanne-3158,
>>>>>> sanne-63971, sanne-20017, sanne-2794, sanne-25511]}
>>>>>> java.util.concurrent.ExecutionException:
>>>>>> org.infinispan.remoting.transport.jgroups.SuspectException: One or
>>>>>> more nodes have left the cluster while replicating command
>>>>>> StateTransferControlCommand{cache=LuceneIndexesMetadata,
>>>>>> type=APPLY_STATE, sender=sanne-20017, viewId=8, state=4}
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev