[infinispan-issues] [JBoss JIRA] (ISPN-1995) Uneven request balancing after node restore

Wednesday, 9 May 2012

    [
https://issues.jboss.org/browse/ISPN-1995?page=com.atlassian.jira.plugin....
] 

Galder Zamarreño commented on ISPN-1995:
----------------------------------------

@Michal, thx for generating the logs. I see two main issues here:

1. Even after nodes have been killed, some servers might still send old views. This can
only happen if clients are sending a different view id to the one in the server. The only
thing that I see possible is this (and logs seem to indicate so). Client-333 sends a view
id, let's say 1, and server has been updated to view id 2, so it sends back the new
view. Now, Client-333 goes to a different node where view has not yet been updated and it
sends view id 2. The server still has view id 1, and so, as it considers it
'different', it sends back the old view. The client cannot connect to the servers
that have been killed. 

Removing crashed members happens in an async view change listener, which works on the
local cache only. This is done this way to avoid issues with transport view id changing
before the cache has been updated, see ISPN-1654 for more info.

So, this particular issue can be easily solved by servers skipping sending view changes if
clients have 'newer' view id, by newer I mean strictly higher view ids.

"New topology..." info message could also with adding what the view id is, plus
which node this topology comes from.

2. The second problem is a more weird one. The following doesn't make much sense:

{code}02:56:03,620 INFO 
[org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Client-333)
ISPN004014: New server added(/192.168.87.92:11222), adding to the pool.
...
02:56:06,621 WARN  [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport]
(Client-333) ISPN004008: Could not connect to server: /192.168.87.91:11222{code}

Assuming that 192.168.87.91 is node0002, and node0002 has been restarted, it would appear
that either:
a) cluster formed and server view in another node was updated before the endpoint was
fully started in .91
b) or, the view was updated around the cluster, but the server endpoint failed to start.

Looking at the server.log for .91, the problem is a):

{code}02:55:56,711 INFO  [stdout] (pool-2-thread-1) GMS: address=node0002/default,
cluster=default, physical address=192.168.87.91:55200
02:55:56,711 INFO  [stdout] (pool-2-thread-1)
-------------------------------------------------------------------
02:55:56,913 INFO  [com.jboss.datagrid] (MSC service thread 1-16) REST Server starting
02:55:56,914 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread
1-15) MemcachedServer starting
02:55:56,914 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread
1-5) HotRodServer starting 
...
02:56:10,146 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread
1-5) HotRodServer started in 13232ms
{code}

What's happening is the cache view installation, state transfer...etc, which is
delaying the endpoint being started. At first glance, moving the registration of the node
after transport has started could work, but what if a request comes in and the the view id
is not yet set? There's a second problem here, and that is the fact that the view id
is set with a view listener, so if the transport is started before the view listener is
installed, we'd miss view ids. So, this needs a bit more thinking.

...
 Uneven request balancing after node restore
 -------------------------------------------

                 Key: ISPN-1995
                 URL: https://issues.jboss.org/browse/ISPN-1995
             Project: Infinispan
          Issue Type: Bug
          Components: Cache Server
    Affects Versions: 5.1.4.CR1
            Reporter: Tristan Tarrant
            Assignee: Galder Zamarreño
             Fix For: 5.1.x, 5.2.0.ALPHA1, 5.2.0.FINAL

 After a node crashes and rejoins the cluster, it does not receive client load at the same
level as the other nodes.
 This issue does not affect data integrity and distribution in the cluster. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-1995) Uneven request balancing after node restore