In our production environment all hosts have duplicated network links. It is intended to
protect from single link failure. Does anyone have any example / best practices how to
configure JGroups for proper work in such environment? (So that JGroups works fine despite
a single link failure).
We made some prototyping but it failed - details below.
Thank you in advance.
Kind regards
Mariusz
Version: JBossCache 1.4.1 SP3, JGroups 2.4.1
Environment: a LAN consisting of two hosts, each host with two NICs (eth0, eth1), the
hosts connected directly (eth0-to-eth0, eth1-to-eth1), configured as single IPv4 subnet.
JGroups was intended to communicate on both interfaces and to use multicast (see
Configuration below)
Test description:
- both links are connected
- on each node started one instance of JBossCache
- replication working correctly
- disconnected link eth1-to-eth1
- replication working correctly
- reconnected link eth1-to-eth1, disconnected link eth0-to-eth0
- replication working correctly
! after a time (around 5sec) both instances communicate an exception (see below) to one
another and break because the exception is not caught
I don't know if it is enough to simply catch the exception. From the top-level I can
see that JGroups/JBossCache does have some problem with this configuration.
Configuration details:
<UDP mcast_addr="228.8.8.8" mcast_port="45566"
ip_ttl="64" ip_mcast="true"
mcast_send_buf_size="150000"
mcast_recv_buf_size="80000"
ucast_send_buf_size="150000"
ucast_recv_buf_size="80000"
loopback="false"
receive_on_all_interfaces="true"
send_on_all_interfaces="true"
receive_interfaces="eth0,eth1"
send_interfaces="eth0,eth1"/>
<PING timeout="2000" num_initial_members="3"
up_thread="false" down_thread="false"/>
<MERGE2 min_interval="10000"
max_interval="20000"/>
<!-- <FD shun="true" up_thread="true"
own_thread="true" />-->
<FD_SOCK/>
<VERIFY_SUSPECT timeout="1500" up_thread="false"
down_thread="false"/>
<pbcast.NAKACK gc_lag="50"
retransmit_timeout="600,1200,2400,4800"
max_xmit_size="8192" up_thread="false"
down_thread="false"/>
<UNICAST timeout="600,1200,2400" window_size="100"
min_threshold="10"
down_thread="false"/>
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="false" down_thread="false"/>
<pbcast.GMS join_timeout="5000"
join_retry_timeout="2000"
shun="true" print_local_addr="true"/>
<FC max_credits="2000000" down_thread="false"
up_thread="false"
min_threshold="0.20"/>
<FRAG frag_size="8192" down_thread="false"
up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true"
down_thread="true"/>
Logs with exception:
[2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.GroupRequest(execute:195)]: call
did not execute correctly, request is [GroupRequest:
req_id=1188480009786
caller=10.10.0.2:32781
10.10.0.1:32781: sender=10.10.0.1:32781, retval=null, received=false, suspected=false
request_msg: [dst: , src: 10.10.0.2:32781 (3 headers), size = 34 bytes]
rsp_mode: GET_ALL
done: false
timeout: 20000
expected_mbrs: 0
]
[2007-08-30 15:20:29,796|DEBUG|main;
|org.jgroups.blocks.RpcDispatcher(callRemoteMethods:193)]: responses:
[sender=10.10.0.1:32781, retval=null, received=false, suspected=false]
[2007-08-30 15:20:29,797|DEBUG|main; |org.jboss.cache.TreeCache(callRemoteMethods:4405)]:
(10.10.0.2:32781): responses for method _replicate:
[sender=10.10.0.1:32781, retval=null, received=false, suspected=false]
[2007-08-30 15:20:29,798|DEBUG|main;
|org.jboss.cache.interceptors.BaseRpcInterceptor(replicateCall:118)]:
responses=[org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null,
received=false, suspected=false]
[2007-08-30 15:20:29,800|DEBUG|main;
|org.jboss.cache.interceptors.BaseRpcInterceptor(checkResponses:79)]: Received Throwable
from remote node
org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null,
received=false, suspected=false
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4422)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4344)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4455)
at
org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
at
org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
at
org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
at
org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
at
org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:365)
at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
at
org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:183)
at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5863)
at org.jboss.cache.TreeCache.remove(TreeCache.java:3929)
at org.jboss.cache.TreeCache.remove(TreeCache.java:3915)
at test.jbcache.DistributedTree.remove(DistributedTree.java:41)
at test.jbcache.DistributedTest.handleSession(DistributedTest.java:46)
at test.jbcache.DistributedTest.main(DistributedTest.java:78)
Caused by: org.jboss.cache.lock.TimeoutException: Response timed out:
sender=10.10.0.1:32781, retval=null, received=false, suspected=false
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4420)
... 17 more
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4079887#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...