[jboss-user] [JBossCache] - Problem when a node tries to join the cluster

Thu Nov 16 10:13:42 EST 2006

Hello,

I have some problems when try to bring up a new node to an existing cluster. It seems the new node
waits for pretty long time before it receives the join response. I set the PING timeout  to be 2 seconds,
the new node always timed out and thought itself to be the first member in the group, and thus, cannot
fetch the in-memory state from other nodes during startup. Until after sometime, it joined the group,  then
everything works fine. But before that,  system does not work properly because the cache state is not
synchronized in the new node. (I use JBOSS TreeCache)

I changed the PING timeout to be 20 seconds, in most cases, the
new node can join the existing group, but there are still some cases the new node timed out and
thought itself to be the first member in the group. I use Fedora core 6 linux box, use JBoss TreeCache
and tomcat. The cluster includes only two nodes. I just wonder why it takes so long for the new node
to join the existing group?

The trace information and JBoss Cache configuration are attached. Any hints are welcome.

Thanks in advance,

Jian Fang

--------------------------------
Trace information:
--------------------------------
11:	class org.jgroups.protocols.PerfHeader
12:	class org.jgroups.protocols.PIGGYBACK$PiggybackHeader
13:	class org.jgroups.protocols.PingHeader
--

2006-11-15 13:03:26,292 (main) org.jgroups.protocols.AUTOCONF.senseMaxFragSize
  DEBUG: frag_size=64000

2006-11-15 13:03:26,361 (main) org.jgroups.protocols.pbcast.GMS.setImpl
  DEBUG: changed role to org.jgroups.protocols.pbcast.ClientGmsImpl

2006-11-15 13:03:26,415 (DownHandler (STABLE)) org.jgroups.protocols.pbcast.STABLE.startStableTask
  DEBUG: stable task started

2006-11-15 13:03:26,420 (DownHandler (UDP)) org.jgroups.protocols.UDP.start
  DEBUG: creating sockets and starting threads

2006-11-15 13:03:26,421 (DownHandler (UDP)) org.jgroups.protocols.UDP.createSockets
  INFO : sockets will use interface 10.20.2.37

2006-11-15 13:03:26,426 (DownHandler (UDP)) org.jgroups.protocols.UDP.createSockets
--
mcast_send_sock: bound to 10.20.2.37:33043, send buffer size=320000, receive buffer size=320000

2006-11-15 13:03:26,433 (DownHandler (UDP)) org.jgroups.protocols.UDP.bindToInterfaces
  DEBUG: joined /224.0.0.75:7500 on eth0

2006-11-15 13:03:26,435 (DownHandler (UDP)) org.jgroups.protocols.UDP.bindToInterfaces
  DEBUG: joined /224.0.0.75:7500 on lo

2006-11-15 13:03:26,438 (DownHandler (UDP)) org.jgroups.protocols.UDP.startThreads
  DEBUG: created unicast receiver thread

2006-11-15 13:03:26,440 (UpHandler (STATE_TRANSFER)) org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.passUp
  DEBUG: setting local_addr (null) to 10.20.2.37:33042

2006-11-15 13:03:26,442 (PingSender) org.jgroups.protocols.PingSender.run
  DEBUG: sending GET_MBRS_REQ

2006-11-15 13:03:26,442 (PingWaiter) org.jgroups.protocols.PingWaiter.findInitialMembers
  DEBUG: waiting for initial members: time_to_wait=2000, got 0 rsps

2006-11-15 13:03:26,447 (DownHandler (UDP)) org.jgroups.protocols.UDP.down
  DEBUG: sending msg to null (src=10.20.2.37:33042), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[channel_name=JTVCORE-Cluster]}

2006-11-15 13:03:26,450 (UDP mcast receiver) org.jgroups.protocols.UDP.receive
  DEBUG: received (mcast)57 bytes from 10.20.2.37:33043

2006-11-15 13:03:26,454 (UDP mcast receiver) org.jgroups.protocols.UDP.handleIncomingMessage
  DEBUG: message is [dst: 228.8.8.8:45566, src: 10.20.2.37:33042 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=JTVCORE-Cluster], PING=[PING: type=GET_MBRS_REQ, arg=null]}

2006-11-15 13:03:26,455 (UpHandler (PING)) org.jgroups.protocols.PING.up
  DEBUG: discarded my own discovery request

2006-11-15 13:03:27,445 (PingSender) org.jgroups.protocols.PingSender.run
  DEBUG: sending GET_MBRS_REQ

2006-11-15 13:03:27,447 (DownHandler (UDP)) org.jgroups.protocols.UDP.down
  DEBUG: sending msg to null (src=10.20.2.37:33042), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[channel_name=JTVCORE-Cluster]}

2006-11-15 13:03:27,448 (UDP mcast receiver) org.jgroups.protocols.UDP.receive
  DEBUG: received (mcast)57 bytes from 10.20.2.37:33043

2006-11-15 13:03:27,449 (UDP mcast receiver) org.jgroups.protocols.UDP.handleIncomingMessage
  DEBUG: message is [dst: 228.8.8.8:45566, src: 10.20.2.37:33042 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=JTVCORE-Cluster], PING=[PING: type=GET_MBRS_REQ, arg=null]}

2006-11-15 13:03:27,450 (UpHandler (PING)) org.jgroups.protocols.PING.up
  DEBUG: discarded my own discovery request

2006-11-15 13:03:27,557 (UDP mcast receiver) org.jgroups.protocols.UDP.receive
  DEBUG: received (mcast)57 bytes from 10.20.2.38:32789

2006-11-15 13:03:27,558 (UDP mcast receiver) org.jgroups.protocols.UDP.handleIncomingMessage
  DEBUG: message is [dst: 228.8.8.8:45566, src: 10.20.2.38:32788 (2 headers), size = 0 bytes], headers are {UDP=[channel_name=JTVCORE-Cluster], PING=[PING: type=GET_MBRS_REQ, arg=null]}

2006-11-15 13:03:27,559 (UpHandler (PING)) org.jgroups.protocols.PING.up
  DEBUG: received GET_MBRS_REQ from 10.20.2.38:32788, sending response [PING: type=GET_MBRS_RSP, arg=[own_addr=10.20.2.37:33042, coord_addr=10.20.2.37:33042, is_server=false]]

2006-11-15 13:03:27,560 (DownHandler (UDP)) org.jgroups.protocols.UDP.down
  DEBUG: sending msg to 10.20.2.38:32788 (src=10.20.2.37:33042), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=10.20.2.37:33042, coord_addr=10.20.2.37:33042, is_server=false]], UDP=[channel_name=JTVCORE-Cluster]}

2006-11-15 13:03:28,445 (PingWaiter) org.jgroups.protocols.PingWaiter.findInitialMembers
  DEBUG: initial mbrs are []

2006-11-15 13:03:28,446 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.join
  DEBUG: initial_mbrs are []

2006-11-15 13:03:28,447 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.join
  DEBUG: no initial members discovered: creating group as first member

2006-11-15 13:03:28,448 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.installView
  DEBUG: [local_addr=10.20.2.37:33042] view is [10.20.2.37:33042|0] [10.20.2.37:33042]

2006-11-15 13:03:28,456 (main) org.jboss.cache.TreeCache.startService
  INFO : TreeCache local address is 10.20.2.37:33042

2006-11-15 13:03:28,459 (UpHandler (STATE_TRANSFER)) org.jboss.cache.TreeCache.viewAccepted
  INFO : viewAccepted(): [10.20.2.37:33042|0] [10.20.2.37:33042]

2006-11-15 13:03:28,463 (DownHandler (STATE_TRANSFER)) org.jgroups.protocols.pbcast.STATE_TRANSFER.down
  DEBUG: GET_STATE: first member (no state)

2006-11-15 13:03:28,466 (DownHandler (STABLE)) org.jgroups.protocols.pbcast.STABLE.resetDigest
  DEBUG: resetting digest from NAKACK: [10.20.2.37:33042#-1]

2006-11-15 13:03:28,469 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.setImpl
  DEBUG: 10.20.2.37:33042 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl

2006-11-15 13:03:28,469 (DownHandler (GMS)) org.jgroups.protocols.pbcast.GMS.becomeSingletonMember
  DEBUG: created group (first member). My view is [10.20.2.37:33042|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl

2006-11-15 13:03:28,483 (main) org.jboss.cache.TreeCache.fetchStateOnStartup
  INFO : State could not be retrieved (we are the first member in group)

----------------------
configuration
----------------------

		jboss:service=Naming

		jboss:service=TransactionManager

		<!-- Configure the TransactionManager -->
		<!-- org.jboss.cache.DummyTransactionManagerLookup-->
		com.jtv.core.resource.impl.spring.cache.JBossCacheTransactionManagerLookup

		<!--
            Node locking scheme:
                OPTIMISTIC
                PESSIMISTIC (default)
		-->     
       <!--  PESSIMISTIC -->

		<!--
		Node locking level : SERIALIZABLE
		REPEATABLE_READ (default)
		READ_COMMITTED
		READ_UNCOMMITTED
		NONE
		-->
		REPEATABLE_READ

		<!--
		Valid modes are LOCAL
		REPL_ASYNC
		REPL_SYNC
		INVALIDATION_ASYNC
		INVALIDATION_SYNC
		-->
		REPL_SYNC

		<!-- Name of cluster. Needs to be the same for all clusters, in order
		to find each other
		-->
		JTVCORE-Cluster

		<!-- JGroups protocol stack properties. Can also be a URL,
		e.g. file:/home/bela/default.xml

		-->

				<UDP mcast_send_buf_size="320000"
					mcast_port="45566"
					ucast_recv_buf_size="320000"
					mcast_addr="228.8.8.8"
					bind_addr="10.20.2.37"
					loopback="false"
					mcast_recv_buf_size="320000"
					max_bundle_size="60000"
					max_bundle_timeout="30"
					use_incoming_packet_handler="false"
					use_outgoing_packet_handler="false"
					ucast_send_buf_size="320000"
					ip_ttl="32"
					enable_bundling="false"/>
				<PING timeout="20000"
					num_initial_members="2"
                                	up_thread="true"
					down_thread="true"/>
				<MERGE2 max_interval="10000"
					min_interval="5000"/>
				<FD timeout="2000"
					max_tries="3"
					shun="true" 
					up_thread="true" 
					down_thread="true"/>
				<VERIFY_SUSPECT timeout="1500"/>
				<pbcast.NAKACK max_xmit_size="8192"
					use_mcast_xmit="false"
					gc_lag="50"
					retransmit_timeout="600,1200,2400,4800"/>

				<!--
				- desired_avg_gossip: periodically sends STABLE messages around. 0 disables this
				- max_bytes: max number of bytes received from anyone until a STABLE message is sent. Use either this or
							 desired_avg_gossip, but not both ! 0 disables it.
				- stability_delay: range (number of milliseconds) that we wait until sending a STABILITY message.
				  This prevents STABILITY multicast storms. If max_bytes is used, this should be set to a low value (> 0 though !).
				-->
				<pbcast.STABLE stability_delay="1000"
					desired_avg_gossip="20000"
					max_bytes="0"/>
				<FRAG frag_size="8192"
					down_thread="true"
					up_thread="true"/>
				<VIEW_SYNC avg_send_interval="20000" down_thread="true" up_thread="true" />
				<pbcast.GMS print_local_addr="true"
					join_timeout="3000"
					join_retry_timeout="2000"
					shun="true"/>
				<pbcast.STATE_TRANSFER 
					down_thread="true" 
					up_thread="true"/>		

        <!--
         Whether or not to fetch state on joining a cluster
         NOTE this used to be called FetchStateOnStartup and has been renamed to be more descriptive.
        -->
		<!-- -->
		true
		<!-- -->

		<!--
		The max amount of time (in milliseconds) we wait until the
		initial state (ie. the contents of the cache) are retrieved from
		existing members in a clustered environment
		-->
		20000

		<!--
		Number of milliseconds to wait until all responses for a
		synchronous call have been received.
		-->
		20000

		<!-- Max number of milliseconds to wait for a lock acquisition -->
		15000

		<!-- Name of the eviction policy class. -->
		<!--attribute name="EvictionPolicyClass"></attribute-->

		<!-- 
		org.jboss.cache.eviction.LRUPolicy
		-->

		<!-- Specific eviction policy configurations. This is LRU -->

				5
				<!-- Cache wide default /com/jtv/core-->

					5000
					0

					2000
					0

					10000
					0

		<!--
		Indicate whether to use region based marshalling or not. Set this to true if you are running under a scoped
		class loader, e.g., inside an application server. Default is "false".
		-->
		false	

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3986551#3986551

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3986551