[jboss-jira] [JBoss JIRA] (JGRP-2245) JGroup JDBC_PING is not clearing the crashed members

Thu Jan 18 17:25:00 EST 2018

     [ https://issues.jboss.org/browse/JGRP-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sibin Karnavar updated JGRP-2245:
---------------------------------
    Description: 
1) In AWS cloud environments, IP address will be different when a node crashes and when a new cluster node gets recreated.
2) In this situation, JGroup is not clearing logical_addr_cache and it gets confused, when we restart the cluster nodes.
3)logical_addr_cache_max_size and the eviction did not work because, the cache is again getting updated from the ping and it never getting marked as removable.

I think the issue is 

handleView method is always re writing the entire cache on view change to the db. So even if we clear the table with the help of above mentioned flags (remove_all_data_on_view_change && remove_old_coords_on_view_change) , its getting re written to the table.

 // remove all files which are not from the current members
    protected void handleView(View new_view, View old_view, boolean coord_changed) {
        if(is_coord) {
            if(coord_changed) {
                if(remove_all_data_on_view_change)
                    removeAll(cluster_name);
                else if(remove_old_coords_on_view_change) {
                    Address old_coord=old_view != null? old_view.getCreator() : null;
                    if(old_coord != null)
                        remove(cluster_name, old_coord);
                }
            }
            if(coord_changed || View.diff(old_view, new_view)[1].length > 0) {
                writeAll();
                if(remove_all_data_on_view_change || remove_old_coords_on_view_change)
                    startInfoWriter();
            }
        }
        else if(coord_changed) // I'm no longer the coordinator
            remove(cluster_name, local_addr);
    }

4) because of the crashed members non existing ip address

sendToMembers of TP is trying to send messages to old crashed members and writing error logs.

  was:
1) In AWS cloud environments, IP address will be different when a node crashes and when a new node gets recreated.
2) In this situation, JGroup is not clearing logical_addr_cache and it gets confused when other nodes restarts.
3)logical_addr_cache_max_size and the eviction did not work because, the cache is again getting updated from the ping and it never getting marked as removable.

I think the issue is 

handleView method is always re writing the entire cache on view change to the db. So even if we clear the table with the help of above mentioned flags (remove_all_data_on_view_change && remove_old_coords_on_view_change) , its getting re written to the table.

 // remove all files which are not from the current members
    protected void handleView(View new_view, View old_view, boolean coord_changed) {
        if(is_coord) {
            if(coord_changed) {
                if(remove_all_data_on_view_change)
                    removeAll(cluster_name);
                else if(remove_old_coords_on_view_change) {
                    Address old_coord=old_view != null? old_view.getCreator() : null;
                    if(old_coord != null)
                        remove(cluster_name, old_coord);
                }
            }
            if(coord_changed || View.diff(old_view, new_view)[1].length > 0) {
                writeAll();
                if(remove_all_data_on_view_change || remove_old_coords_on_view_change)
                    startInfoWriter();
            }
        }
        else if(coord_changed) // I'm no longer the coordinator
            remove(cluster_name, local_addr);
    }

4) because of the crashed members non existing ip address

sendToMembers of TP is trying to send messages to old crashed members and writing error logs.

> JGroup JDBC_PING is not clearing the crashed members
> ----------------------------------------------------
>
>                 Key: JGRP-2245
>                 URL: https://issues.jboss.org/browse/JGRP-2245
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.0.8
>            Reporter: Sibin Karnavar
>            Assignee: Bela Ban
>            Priority: Critical
>
> 1) In AWS cloud environments, IP address will be different when a node crashes and when a new cluster node gets recreated.
> 2) In this situation, JGroup is not clearing logical_addr_cache and it gets confused, when we restart the cluster nodes.
> 3)logical_addr_cache_max_size and the eviction did not work because, the cache is again getting updated from the ping and it never getting marked as removable.
> I think the issue is 
> handleView method is always re writing the entire cache on view change to the db. So even if we clear the table with the help of above mentioned flags (remove_all_data_on_view_change && remove_old_coords_on_view_change) , its getting re written to the table.
>  // remove all files which are not from the current members
>     protected void handleView(View new_view, View old_view, boolean coord_changed) {
>         if(is_coord) {
>             if(coord_changed) {
>                 if(remove_all_data_on_view_change)
>                     removeAll(cluster_name);
>                 else if(remove_old_coords_on_view_change) {
>                     Address old_coord=old_view != null? old_view.getCreator() : null;
>                     if(old_coord != null)
>                         remove(cluster_name, old_coord);
>                 }
>             }
>             if(coord_changed || View.diff(old_view, new_view)[1].length > 0) {
>                 writeAll();
>                 if(remove_all_data_on_view_change || remove_old_coords_on_view_change)
>                     startInfoWriter();
>             }
>         }
>         else if(coord_changed) // I'm no longer the coordinator
>             remove(cluster_name, local_addr);
>     }
> 4) because of the crashed members non existing ip address
> sendToMembers of TP is trying to send messages to old crashed members and writing error logs.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)