[jboss-jira] [JBoss JIRA] (JGRP-1957) S3_PING: Nodes never removed from .list file

Tue Apr 12 17:00:01 EDT 2016

    [ https://issues.jboss.org/browse/JGRP-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190836#comment-13190836 ] 

Mitchell Ackerman commented on JGRP-1957:
-----------------------------------------

Unfortunately I seem to be running into the same or similar issue, even though I've updated to JGroups 3.6.8 and am using the settings you suggest in this (and other) posts.

I'm running in AWS using S3_PING, JDK 1.8.0_66, JGroups 3.6.8, Tomcat 8.0.28.  

After terminating servers, mostly non-coordinators, I'm left with an S3 bucket with lots of zombies (there are only 2 active members), here is the file after the system has been stable for over an hour, and my JGroups config file.  Any suggestions?

thanks, Mitchell

ip-10-89-1-26-8729 	72597f74-8a10-04fb-b397-22a3ed35da84 	10.89.1.26:7800 	F
ip-10-89-0-18-38996 	a5325932-e9cd-b281-b367-e2d86845aa75 	10.89.0.18:7800 	F
ip-10-89-1-62-4868 	ef73921a-2265-50a8-95d4-ebb8cae96944 	10.89.1.62:7800 	T
ip-10-89-1-27-11915 	5a0b4a26-b542-56f2-801a-420b5d7dbf34 	10.89.1.27:7800 	F
ip-10-89-1-19-2542 	c30c294d-69b0-b6ca-7010-bf89d1eb8f6f 	10.89.1.19:7800 	F
ip-10-89-0-62-56914 	fa2262c3-9097-7101-b225-24d8a52d905e 	10.89.0.62:7800 	F
ip-10-89-0-28-32680 	5d03124f-b061-becb-d793-6067bf0d7945 	10.89.0.28:7800 	F
ip-10-89-1-26-51248 	07cc18aa-381b-fb5d-0ad6-0612f7a5e9bb 	10.89.1.26:7800 	F
ip-10-89-1-27-39755 	1f9be940-2228-2181-ef80-4a83d319a2b3 	10.89.1.27:7800 	F
ip-10-89-0-28-41919 	4ab543f9-712e-645d-2f20-05304c98a23b 	10.89.0.28:7800 	F
ip-10-89-1-27-10428 	d5b0cb38-75e0-b3e1-c053-66b053b0fb05 	10.89.1.27:7800 	F

my JGroups config file is:

<?xml version="1.0" encoding="UTF-8"?>
<config>
   <TCP
        bind_port="7800"
        port_range="30"
        recv_buf_size="20000000"
        send_buf_size="1000000"
        max_bundle_size="64000"
        max_bundle_timeout="1000"
        sock_conn_timeout="2000"
        enable_diagnostics="false"

        timer_type="new"
        timer.min_threads="4"
        timer.max_threads="10"
        timer.keep_alive_time="3000"
        timer.queue_max_size="1000"
        timer.wheel_size="200"
        timer.tick_time="50"

        thread_pool.enabled="true"
        thread_pool.min_threads="2"
        thread_pool.max_threads="100"
        thread_pool.keep_alive_time="60000"
        thread_pool.queue_enabled="true"
        thread_pool.queue_max_size="100000"
        thread_pool.rejection_policy="discard"

        oob_thread_pool.enabled="true"
        oob_thread_pool.min_threads="10"
        oob_thread_pool.max_threads="100"
        oob_thread_pool.keep_alive_time="60000"
        oob_thread_pool.queue_enabled="false"
        oob_thread_pool.queue_max_size="100"
        oob_thread_pool.rejection_policy="discard"   

        logical_addr_cache_expiration="1000"
        logical_addr_cache_reaper_interval="10000"
         />

   <S3_PING location="bob-s3-ping-dev" remove_all_files_on_view_change="true" remove_old_coords_on_view_change="true"/>

   <MERGE3 max_interval="60000" min_interval="30000"/>

   <FD_SOCK/>

   <FD timeout="3000" max_tries="5"/>

   <VERIFY_SUSPECT timeout="2000"/>

   <pbcast.NAKACK use_mcast_xmit="false" retransmit_timeout="300,600,1200,2400,4800" discard_delivered_msgs="true"/>

   <UNICAST3/>

   <pbcast.STABLE stability_delay="1500" desired_avg_gossip="50000" max_bytes="2m"/>

   <pbcast.GMS print_local_addr="false" join_timeout="2500" max_bundling_time="50" view_bundling="true" max_join_attempts="${jgroups_max_join_attempts}"/>

   <pbcast.STATE_TRANSFER  />

   <!-- top -->
   <!-- /\ down -->
   <!-- \/ up -->

</config>

> S3_PING: Nodes never removed from .list file
> --------------------------------------------
>
>                 Key: JGRP-1957
>                 URL: https://issues.jboss.org/browse/JGRP-1957
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.6.4
>         Environment: JGroups client running on Mac OS X - Yosemite
> JDK 1.7.71
>            Reporter: Nick Sawadsky
>            Assignee: Bela Ban
>            Priority: Minor
>             Fix For: 3.6.6
>
>
> I'm not 100% sure, but it seems like there might be a defect here.
> I'm using TCP, S3_PING, and MERGE3. 
> I've set logical_addr_cache_max_size to 2 for testing purposes, although I don't think the value of this setting affects my test results.
> I start a single node, node A. Then I start a second node, node B.
> I then repeatedly shutdown and restart node B.
> Each time node B starts, a new row is added to the .list file stored in S3. 
> But even if I continue this process for 15 minutes, old rows are never removed from the .list file, so it continues to grow in size.
> I've read the docs and mailing list threads, so I'm aware that the list is not immediately updated as soon as a member leaves. But I was expecting that when a view change occurs, nodes no longer in the view would be marked for removal (line 2193 of TP.java) and then after the logical_addr_cache_expiration has been reached and the reaper kicks in, once a new node joins, the expired cache entries would be purged from the file.
> I dug in to the code a bit, and what seems to be happening is that the MERGE3 protocol periodically generates a FIND_MBRS event. S3_PING retrieves the membership from the .list file, which includes expired nodes. And then all of these members are re-added to the logical address cache (line 157 of S3_PING.java, line 533 of Discovery.java, line 2263 of TP.java).
> So expired nodes are continually re-added to the logical address cache, preventing them from ever being reaped.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)