[infinispan-issues] [JBoss JIRA] (ISPN-1995) Uneven request balancing after node restore

Fri May 18 08:06:18 EDT 2012

    [ https://issues.jboss.org/browse/ISPN-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693950#comment-12693950 ] 

Galder Zamarreño commented on ISPN-1995:
----------------------------------------

Cache filling up after view is set is what's happening here actually. This is what's happening:

a)Before ISPN-1995, the order of startup was:
1. Add node to topology
2. Start cache manager + JGroups transport
3. Start server endopoint.

The problem with this was that other nodes were finding nodes in topology that might not yet have started the endpoint, resulting in connection refused exceptions.

b) The fix I implemented changed this around to be:
1. Start cache manager + JGroups transport
2. Start server endopoint.
3. Add node to topology

The problem now is that JGroups transport starts before the node is added, and so the topology id (which is updated when a node is added to the cache), can be updated (if all nodes have formed the cluster already) before the other nodes have added themselves to the topology cache, hence resulting in missing nodes.

I think the order we have now is fine. I can see two ways to fix this:
1. Make ViewIdUpdater only update the view id when the view matches all the nodes in the cache. IOW, in the above example, view id would only be set to 2, once we've verified that the address cache contains all 4 nodes in the JGroups view. The problem here is that the verification would need to be done in ViewUpdater, and I'm not sure the cache will have all the entries (annoyances of cache listeners)
2. Instead of relying on the transport view id, use the cache view id for the Hot Rod cache. Need to check with Dan, but so far is my preferred option.

> Uneven request balancing after node restore
> -------------------------------------------
>
>                 Key: ISPN-1995
>                 URL: https://issues.jboss.org/browse/ISPN-1995
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Cache Server
>    Affects Versions: 5.1.4.CR1
>            Reporter: Tristan Tarrant
>            Assignee: Galder Zamarreño
>             Fix For: 5.1.5.CR1, 5.2.0.ALPHA1, 5.2.0.FINAL
>
>
> After a node crashes and rejoins the cluster, it does not receive client load at the same level as the other nodes.
> This issue does not affect data integrity and distribution in the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira