[infinispan-issues] [JBoss JIRA] (ISPN-1995) Uneven request balancing after node restore

Thu May 10 04:59:18 EDT 2012

    [ https://issues.jboss.org/browse/ISPN-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12691720#comment-12691720 ] 

RH Bugzilla Integration commented on ISPN-1995:
-----------------------------------------------

JBoss JIRA Server <jira-update at redhat.com> made a comment on [bug 809631|https://bugzilla.redhat.com/show_bug.cgi?id=809631]

Galder Zamarreño <galder.zamarreno at redhat.com> made a comment on jira ISPN-1995

@Michal, thx for generating the logs. I see two main issues here:

1. Even after nodes have been killed, some servers might still send old views. This can only happen if clients are sending a different view id to the one in the server. The only thing that I see possible is this (and logs seem to indicate so). Client-333 sends a view id, let's say 1, and server has been updated to view id 2, so it sends back the new view. Now, Client-333 goes to a different node where view has not yet been updated and it sends view id 2. The server still has view id 1, and so, as it considers it 'different', it sends back the old view. The client cannot connect to the servers that have been killed. 

Removing crashed members happens in an async view change listener, which works on the local cache only. This is done this way to avoid issues with transport view id changing before the cache has been updated, see ISPN-1654 for more info.

So, this particular issue can be easily solved by servers skipping sending view changes if clients have 'newer' view id, by newer I mean strictly higher view ids.

"New topology..." info message could also with adding what the view id is, plus which node this topology comes from.

2. The second problem is a more weird one. The following doesn't make much sense:

{code}02:56:03,620 INFO  [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Client-333) ISPN004014: New server added(/192.168.87.92:11222), adding to the pool.
...
02:56:06,621 WARN  [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Client-333) ISPN004008: Could not connect to server: /192.168.87.91:11222{code}

Assuming that 192.168.87.91 is node0002, and node0002 has been restarted, it would appear that either:
a) cluster formed and server view in another node was updated before the endpoint was fully started in .91
b) or, the view was updated around the cluster, but the server endpoint failed to start.

Looking at the server.log for .91, the problem is a):

{code}02:55:56,711 INFO  [stdout] (pool-2-thread-1) GMS: address=node0002/default, cluster=default, physical address=192.168.87.91:55200
02:55:56,711 INFO  [stdout] (pool-2-thread-1) -------------------------------------------------------------------
02:55:56,913 INFO  [com.jboss.datagrid] (MSC service thread 1-16) REST Server starting
02:55:56,914 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-15) MemcachedServer starting
02:55:56,914 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-5) HotRodServer starting 
...
02:56:10,146 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-5) HotRodServer started in 13232ms
{code}

What's happening is the cache view installation, state transfer...etc, which is delaying the endpoint being started. At first glance, moving the registration of the node after transport has started could work, but what if a request comes in and the the view id is not yet set? There's a second problem here, and that is the fact that the view id is set with a view listener, so if the transport is started before the view listener is installed, we'd miss view ids. So, this needs a bit more thinking.

> Uneven request balancing after node restore
> -------------------------------------------
>
>                 Key: ISPN-1995
>                 URL: https://issues.jboss.org/browse/ISPN-1995
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Cache Server
>    Affects Versions: 5.1.4.CR1
>            Reporter: Tristan Tarrant
>            Assignee: Galder Zamarreño
>             Fix For: 5.1.x, 5.2.0.ALPHA1, 5.2.0.FINAL
>
>
> After a node crashes and rejoins the cluster, it does not receive client load at the same level as the other nodes.
> This issue does not affect data integrity and distribution in the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira