[jboss-user] [Clustering/JBoss] - Cluster Membership after Network Failure

Fri Sep 22 13:06:25 EDT 2006

I'm using version 4.0.4 and I can't seem to get my cluster configuration right.
I have 2 nodes each using the TCP config:

  |          <Config>
  |             <TCP bind_addr="X.X.X.1" start_port="7800" loopback="true" conn_expire_time="5000"/>
  |             <TCPPING initial_hosts="X.X.X.1[7800],X.X.X.2[7800]" port_range="1" timeout="3500"
  |                num_initial_members="2" up_thread="true" down_thread="true"/>
  |             <MERGE2 min_interval="5000" max_interval="10000"/>
  |             <FD_SOCK down_thread="false" up_thread="false"/>
  |             <FD timeout="2500" shun="true" max_tries="5" up_thread="false" down_thread="false" />
  |             <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
  |             <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
  |                retransmit_timeout="3000"/>
  |             <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
  |             <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
  |                print_local_addr="true" down_thread="true" up_thread="true"/>
  |             <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
  |          </Config>
  | 

If I pull the network cable from one of the nodes, wait a minute, then plug it back in, the cluster membership is never rebuilt on both nodes.
At that point farming doesn't work and I have to restart one of the nodes.

Here is a snippet of a consolidated server log:
anonymous wrote : 
  | node-1 2006-09-22 11:18:32,100 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: node-2:7800 (additional data: 17 bytes)
  | node-2 2006-09-22 11:18:32,203 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: node-1:7800 (additional data: 17 bytes)
  | node-2 2006-09-22 11:18:32,212 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 4, delta: -1) : [X.X.X.2:-1]
  | node-2 2006-09-22 11:18:32,216 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (X.X.X.2:-1) received membershipChanged event:
  | node-2 2006-09-22 11:18:32,217 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 1 ([X.X.X.1:-1])
  | node-2 2006-09-22 11:18:32,217 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
  | node-2 2006-09-22 11:18:32,218 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([X.X.X.2:-1])
  | node-1 2006-09-22 11:18:34,633 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 4, delta: -1) : [X.X.X.1:-1]
  | node-1 2006-09-22 11:18:34,634 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (X.X.X.1:-1) received membershipChanged event:
  | node-1 2006-09-22 11:18:34,635 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 1 ([X.X.X.2:-1])
  | node-1 2006-09-22 11:18:34,635 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
  | node-1 2006-09-22 11:18:34,635 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([X.X.X.1:-1])
  | node-2 2006-09-22 11:18:34,892 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [node-2:7810|2] [node-2:7810]
  | node-1 2006-09-22 11:18:36,139 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: node-2:7800 (additional data: 17 bytes)
  | node-1 2006-09-22 11:23:52,531 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [node-1:7810|2] [node-1:7810]
  | node-2 2006-09-22 11:24:05,025 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [node-2:7810|0] [node-2:7810]
  | node-2 2006-09-22 11:24:05,025 INFO  [org.jboss.cache.TreeCache] new cache is null (may be first member in cluster)
  | node-1 2006-09-22 11:24:05,059 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [node-1:7810|0] [node-1:7810]
  | node-1 2006-09-22 11:24:05,059 INFO  [org.jboss.cache.TreeCache] new cache is null (may be first member in cluster)
  | 

And here is a snippet of the jgroups log on node-1:
anonymous wrote : 
  | 
  |  2006-09-22 11:18:15,537 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7810 (own address=node-1:7810)
  |  2006-09-22 11:18:15,541 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7800 (additional data: 17 bytes) (own address=node-1:7800 (additional data: 17 bytes))
  |  2006-09-22 11:18:15,541 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node-2:7800 (additional data: 17 bytes) (number=0)
  |  2006-09-22 11:18:16,365 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
  |  2006-09-22 11:18:19,149 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 9 (9)], node-2:7810: [0 : 4 (4)]] (num_gossip_runs=1, max_gossip_runs=3)
  |  2006-09-22 11:18:19,150 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task terminating (num_gossip_runs=0, max_gossip_runs=3)
  |  2006-09-22 11:18:25,166 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#19 (19), node-2:7800 (additional data: 17 bytes)#87 (87) from node-1:7800 (additional data: 17 bytes)  2006-09-22 11:18:28,082 DEBUG [org.jgroups.protocols.FD] [node-1:7800 (additional data: 17 bytes)]: received no heartbeat ack from node-2:7800 (additional data: 17 bytes) for 6 times (15000 milliseconds), suspecting it
  |  2006-09-22 11:18:28,082 DEBUG [org.jgroups.protocols.FD] mbr=node-2:7800 (additional data: 17 bytes) (size=1)
  |  2006-09-22 11:18:30,586 DEBUG [org.jgroups.protocols.FD] mbr=node-2:7810 (size=1)
  |  2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7800 (additional data: 17 bytes) (own address=node-1:7800 (additional data: 17 bytes))
  |  2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node-2:7800 (additional data: 17 bytes) (number=0)
  |  2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7800 (additional data: 17 bytes)]] to group
  |  2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:18:30,591 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7800 (additional data: 17 bytes)], from=node-1:7800 (additional data: 17 bytes))]
  |  2006-09-22 11:18:32,098 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7800 (additional data: 17 bytes)
  |  2006-09-22 11:18:32,098 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
  |  2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=4, current members=(node-1:7800 (additional data: 17 bytes), node-2:7800 (additional data: 17 bytes)), new_mbrs=(), old_mbrs=(), suspected_mbrs=(
  | node-2:7800 (additional data: 17 bytes))
  |  2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]
  |  2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] mcasting view {[node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]} (1 mbrs)
  | 
  |  2006-09-22 11:18:32,099 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7800 (additional data: 17 bytes)
  |  2006-09-22 11:18:33,098 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7800 (additional data: 17 bytes), node-2:7800 (additional data: 17 bytes)], pingable_mbrs=[node-1:7800 (additional d
  | ata: 17 bytes)], local_addr=node-1:7800 (additional data: 17 bytes)
  |  2006-09-22 11:18:34,631 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]
  |  2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7800 (additional data: 17 bytes)] view is [node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 byte
  | s)]
  |  2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
  |  2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing node-2:7800 (additional data: 17 bytes) from received_msgs (not member anymore)
  |  2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [node-2:7800 (additional data: 17 bytes)], after adjustment: [], stopped: true 
  |  2006-09-22 11:18:34,633 DEBUG [org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [node-1:7800 (additional data: 17 bytes)] 
  |  2006-09-22 11:18:34,634 DEBUG [org.jgroups.protocols.FD_SOCK] socket to null was reset
  |  2006-09-22 11:18:34,634 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
  |  2006-09-22 11:18:36,138 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr node-2:7800 (additional data: 17 bytes) is not a member !
  |  2006-09-22 11:18:36,139 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7800 (additional data: 17 bytes)
  |  2006-09-22 11:18:38,818 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7800 (additional data: 17 bytes): [0 : 21 (21)]] (num_gossip_runs=3, max_gossip_runs=3)
  |  2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#21 (21) from node-1:7800 (additional data: 17 bytes)
  |  2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#21 (21)
  |  2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 270
  |  2006-09-22 11:18:39,098 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7800 (additional data: 17 bytes)#21]
  |  2006-09-22 11:18:39,099 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
  |  2006-09-22 11:18:39,099 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7800 (additional data: 17 bytes): [-1 : 21 (21)]]
  |  2006-09-22 11:22:58,567 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#9 (9), node-2:7810#4 (4) from node-1:7810
  |  2006-09-22 11:22:58,571 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:22:58,580 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7810
  |  2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
  |  2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=2, current members=(node-1:7810, node-2:7810), new_mbrs=(), old_mbrs=(), suspected_mbrs=(node-2:7810)
  |  2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [node-1:7810|2] [node-1:7810]
  |  2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] mcasting view {[node-1:7810|2] [node-1:7810]} (1 mbrs)
  |  2006-09-22 11:23:00,077 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7810
  |  2006-09-22 11:23:01,084 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:23:01,084 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:23:01,084 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:23:04,540 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
  |  2006-09-22 11:23:26,158 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#23 (23) from node-1:7800 (additional data: 17 bytes)
  |  2006-09-22 11:23:26,159 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#23 (23)
  |  2006-09-22 11:23:26,159 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 4955
  |  2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#23 (24) from node-1:7800 (additional data: 17 bytes)
  |  2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#23 (24)
  |  2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=org.jgroups.protocols.pbcast.STABLE$StabilitySendTask at d1ebcd, delay is 5216
  |  2006-09-22 11:23:28,629 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:23:28,629 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:23:28,630 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7800 (additional data: 17 bytes)#23]
  |  2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)                                                                                                                      2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7800 (additional data: 17 bytes): [-1 : 23 (23)]]
  |  2006-09-22 11:23:33,045 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]                                                                                                                                                       2006-09-22 11:23:33,570 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 10 (11)], node-2:7810: [0 : 4 (4)]] (num_gossip_runs=3, max_gossip_runs=3)
  |  2006-09-22 11:23:33,637 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:23:33,637 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:23:33,638 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:23:38,098 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
  |  2006-09-22 11:23:44,754 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
  |  2006-09-22 11:23:46,158 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:23:46,158 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:23:46,158 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:23:48,666 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:23:48,667 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:23:48,667 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:23:49,514 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
  |  2006-09-22 11:23:51,174 WARN  [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
  |  2006-09-22 11:23:51,174 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
  |  2006-09-22 11:23:51,175 DEBUG [org.jgroups.protocols.FD] task done
  |  2006-09-22 11:23:52,443 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  006-09-22 11:23:52,530 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node-1:7810|2] [node-1:7810]
  |  2006-09-22 11:23:52,530 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7810] view is [node-1:7810|2] [node-1:7810]
  |  2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
  |  2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing node-2:7810 from received_msgs (not member anymore)
  |  2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [node-2:7810], after adjustment: [], stopped: true
  |  2006-09-22 11:23:52,534 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#10 (11), node-2:7810#4 (4) from node-1:7810
  |  2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7810#10 (11), node-2:7810#4 (4)
  |  2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 141
  |  2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
  |  2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7810#10, node-2:7810#4]
  |  2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
  |  2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest (digest=[node-1:7810: [-1 : 10 (11)], node-2:7810: [-1 : 4 (4)]]) which does not match my own digest ([node-1:7810: [-1 : -1]): ignoring digest and re-initializing own digest
  |  2006-09-22 11:23:53,950 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7810
  |  2006-09-22 11:23:53,951 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr node-2:7810 is not a member !
  |  2006-09-22 11:23:53,951 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7810
  |  2006-09-22 11:23:57,443 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
  |  2006-09-22 11:23:59,535 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
  |  2006-09-22 11:24:00,963 DEBUG [org.jgroups.protocols.FD] node-2:7810 is not in [node-1:7810] ! Telling it to leave group
  |  2006-09-22 11:24:00,963 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-1:7810], from=node-2:7810)]
  |  2006-09-22 11:24:00,963 WARN  [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
  |  2006-09-22 11:24:00,976 DEBUG [org.jgroups.protocols.FD] [NOT_MEMBER] I'm being shunned; exiting
  |  2006-09-22 11:24:00,979 WARN  [org.jgroups.protocols.pbcast.NAKACK] [node-1:7810] discarded message from non-member node-2:7810
  |  2006-09-22 11:24:00,980 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node-1:7810:
  | sent_msgs: [0 - 13]
  | received_msgs:
  | node-1:7810: received_msgs: [], delivered_msgs: [0 - 13]
  |  2006-09-22 11:24:01,492 DEBUG [org.jgroups.protocols.pbcast.GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
  |  2006-09-22 11:24:05,055 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are []
  |  2006-09-22 11:24:05,055 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] no initial members discovered: creating group as first member
  |  2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7810] view is [node-1:7810|0] [node-1:7810]
  |  2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
  |  2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] node-1:7810 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl
  |  2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] node-1:7810 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl
  |  2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] created group (first member). My view is [node-1:7810|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
  |  2006-09-22 11:24:05,057 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [], after adjustment: [], stopped: true
  |  2006-09-22 11:24:05,058 DEBUG [org.jgroups.protocols.MERGE2] merge task started
  |  2006-09-22 11:24:05,058 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: first member (no state)
  |  2006-09-22 11:24:12,723 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
  |  2006-09-22 11:24:18,476 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
  |  2006-09-22 11:24:25,524 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
  |  2006-09-22 11:24:28,436 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
  |  2006-09-22 11:24:35,345 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
  |  2006-09-22 11:24:38,985 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 0] (num_gossip_runs=3, max_gossip_runs=3)
  |  2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
  |  2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#0 (-1) from node-1:7810
  |  2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7810#0 (-1)
  |  2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 502
  |  2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7810#0]
  |  2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
  |  2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7810: [-1 : 0]
  | 

I have tried the FD config with and without shun, neither option results in the cluster membership being updated.
Any ideas on what I am doing wrong?
Thanks.

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3973608#3973608

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3973608