[infinispan-dev] ISPN-83 (FLUSH removal) causing state transfer timeouts

Paul Ferraro paul.ferraro at redhat.com
Wed Feb 23 11:18:24 EST 2011


After testing AS6 with Infinispan 4.2.1.CR3, I started seeing a slew of
testsuite failures all due to state transfer timeouts.  These are
REPL_SYNC tests, with state transfer enabled:
e.g.

2011-02-23 10:59:46,771 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Initiating state transfer process
2011-02-23 10:59:46,771 INFO  [org.infinispan.remoting.rpc.RpcManagerImpl] Trying to fetch state from T510-29294
2011-02-23 10:59:46,773 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] GET_STATE: asking T510-29294 for state, passing down a SUSPEND_STABLE event, timeout=10000
2011-02-23 10:59:46,773 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
2011-02-23 10:59:46,773 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=11000
2011-02-23 10:59:46,778 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] StateProviderThreadSpawner listening at 192.168.0.3:37173...
2011-02-23 10:59:46,778 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Responding to state requester T510-23375 with address 192.168.0.3:37173 and digest T510-23375: [0 : 0 (0)], T510-29294: [0 : 1 (1)]
2011-02-23 10:59:46,781 DEBUG [org.jgroups.protocols.pbcast.NAKACK]
[overwriteDigest()]
existing digest:  T510-23375: [0 : 0 (0)], T510-29294: [0 : 1 (1)]
new digest:       T510-29294: [0 : 1 (1)], T510-23375: [0 : 0 (0)]
resulting digest: T510-23375: [0 : 0 (0)], T510-29294: [0 : 1 (1)]
2011-02-23 10:59:46,781 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Connecting to state provider /192.168.0.3:37173, original buffer size was 43690 and was reset to 8192
2011-02-23 10:59:46,782 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Connected to state provider, my end of the socket is /192.168.0.3:38305 passing inputstream up...
2011-02-23 10:59:46,783 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] StateProviderThreadSpawner listening at 192.168.0.3:37173...
2011-02-23 10:59:46,783 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Accepted request for state transfer from /192.168.0.3:38305 handing of to PooledExecutor thread
2011-02-23 10:59:46,786 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Running on Thread[STREAMING_STATE_TRANSFER-sender-1,T510-29294,5,JGroups]. Accepted request for state transfer from /192.168.0.3:38305, original buffer size was 25434 and was reset to 8192, passing outputstream up...
2011-02-23 10:59:46,789 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Applying state
2011-02-23 10:59:46,789 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Generating state.  Can provide? true
2011-02-23 10:59:46,822 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Writing 0 StoredEntries to stream
2011-02-23 10:59:46,833 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] State generated, closing object stream
2011-02-23 10:59:46,834 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] State writer is closing the socket
2011-02-23 10:59:56,774 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
2011-02-23 10:59:56,834 WARN  [org.infinispan.remoting.rpc.RpcManagerImpl] Could not find available peer for state, backing off and retrying
2011-02-23 10:59:56,837 ERROR [org.infinispan.remoting.transport.jgroups.JGroupsTransport] Caught while requesting or applying state
org.infinispan.statetransfer.StateTransferException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from T510-29294
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:332)
        at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:199)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:582)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:712)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
        at org.jgroups.JChannel.up(JChannel.java:1422)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:525)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:464)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:190)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:419)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:401)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
        at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:615)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:295)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
        at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
        at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:169)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:269)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
        at org.jgroups.protocols.Discovery.up(Discovery.java:292)
        at org.jgroups.protocols.PING.up(PING.java:67)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1093)
        at org.jgroups.protocols.TP.access$100(TP.java:56)
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1633)
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1615)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)
Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from T510-29294
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:267)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:116)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:394)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:102)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:126)
        at org.infinispan.statetransfer.StateTransferManagerImpl.mimicPartialFlushViaRPC(StateTransferManagerImpl.java:300)
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:253)
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:321)
        ... 30 more

The problematic tests use the default jgroups configuration
(jgroups-udp.xml from the infinispan-core jar).  If I override the
default jgroups config and add FLUSH, these tests pass.
Note that there is no cache activity on the 1st cache prior to the time
the 2nd node joins and starts its cache.
Does the cache need to be configured differently if FLUSH is not used?

Thoughts?



More information about the infinispan-dev mailing list