[jboss-jira] [JBoss JIRA] Commented: (JGRP-1171) Address cache in TP protocol never removes inactive members, which causes enourmous delays sending multicast messages using TCP

Wed Mar 31 07:22:37 EDT 2010

    [ https://jira.jboss.org/jira/browse/JGRP-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12523058#action_12523058 ] 

Bela Ban commented on JGRP-1171:
--------------------------------

See my comments on JGRP-1147 for more details.

> Address cache in TP protocol never removes inactive members, which causes enourmous delays sending multicast messages using TCP
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: JGRP-1171
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1171
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9
>            Reporter: Fedor Cherepanov
>            Assignee: Bela Ban
>             Fix For: 2.10
>
>
> org.jgroups.blocks.LazyRemovalCache used in org.jgroups.protocols.TP removes marked cache items only when it's size exceeds max_elements size, which is set to 20 in TP.
> I'm using jgroups (tried 2.8 and 2.9) with jboss-cache 3.2.1, using TCP protocol. I've tried to investigate why when any node leaves the cluster, replication time increases by a second (around 50ms initially). 
> Here's what I found:
> What a node leaves the cluster and view changes:
> 1. TP calls logical_addr_cache.retainAll(members);
> 2. LazyRemovalCache.retainAll updates the map, setting removable flag to true on those members that are not in the view.
> 3. LazyRemovalCache.checkMaxSizeExceeded NEVER removes them from the cache because it's size is always less than max_elements, which is 20.
> 1. BasicTCP.sendMulticast calls TP.sendToAllPhysicalAddresses
> 2. TP.sendToAllPhysicalAddresses iterates through all values in logical_addr_cache calling sendUnicast for each
> 3. logical_addr_cache contains all the nodes including those killed, and tries to connect to each if them, which causes enormous delays
> This is causing replication time to increase for connection timeout for every node removed from cluster

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira