[infinispan-issues] [JBoss JIRA] (ISPN-2986) Intermittent failure to start new nodes during heavy write load

Mon Apr 8 09:00:43 EDT 2013

     [ https://issues.jboss.org/browse/ISPN-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Berindei resolved ISPN-2986.
--------------------------------

    Resolution: Partially Completed

Increasing the number of OOB threads fixed the issue. ISPN-2849 will improve the situation with a lower number of OOB threads.

> Intermittent failure to start new nodes during heavy write load
> ---------------------------------------------------------------
>
>                 Key: ISPN-2986
>                 URL: https://issues.jboss.org/browse/ISPN-2986
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Distributed Cache, Server
>    Affects Versions: 5.2.5.Final
>         Environment: 4 servers running linux 2.6.32-220.13.1.el6.x86_64 with 2x QuadCore 2.4ghz CPU's
> Gigabit ethernet, same switch.
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
>            Reporter: Marc Bridner
>            Assignee: Tristan Tarrant
>         Attachments: logs.zip, test-infinispan.xml, test-jgroups.xml, test.infinispan.zip
>
>
> When under heavy write load from a hotrod client with 64+ threads and a new node is started, the new node will sometimes fail to start, eventually giving off state transfer timeouts and finally terminating. During the time it takes it to time out (~10 minutes) the hotrod client is totally blocked.
> Setup is as follows:
> 3 servers, 1 client
>  * dl380x2385, 10.64.106.21, client
>  * dl380x2384, 10.64.106.20, first node
>  * dl380x2383, 10.64.106.19, second node
>  * dl380x2382, 10.64.106.18, third node
> 2 caches, initial state transfer off, transactions on, config is attached.
> Small app that triggers the problem is also attached.
> Steps to reproduce:
> 1. Start first node
> 2. Start client, wait for counter to reach 50000 (in client)
> 3. Start second node. 10% chance it'll fail.
> 4. Wait for counter to reach 100000 in client.
> 5. Start third node, 50% chance it'll fail.
> If it doesn't fail, terminate everything and start over.
> I realize this may be hard to reproduce, so if any more logs or tests are needed, let me know.
> I've been unable to reproduce it on a single physical machine, and it only occurs when using more than 64 client threads. Changing the ratio of writes between the caches also seems to make it not occur. I was unable to reproduce it with TRACE log level on (too slow), but if you can specify some packages that you want traces of, that might work.
> Turning transactions off makes it worse, 90% chance to fail on second node. Funny enough, disabling the concurrent GC lowers the failure rate to 10% on third node. Guessing race condition somewhere, may be similar to ISPN-2982.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira