[infinispan-dev] ISPN-83 (FLUSH removal) causing state transfer timeouts

Galder Zamarreño galder at redhat.com
Thu Feb 24 05:29:33 EST 2011


Interesting stuff, Michal had reported a very similar issue: https://issues.jboss.org/browse/ISPN-937

And I had thought it might be related to the rehashing changes that also changed some of the state transfer code.

I've asked Michal to try with a udp stack that contains FLUSH.

On Feb 23, 2011, at 7:09 PM, Paul Ferraro wrote:

> CR2 was unusable for me because of ISPN-934.
> 
> On Wed, 2011-02-23 at 11:54 -0500, Vladimir Blagojevic wrote:
>> Paul,
>> 
>> How is the testsuite tolerating 4.2.1CR2? I removed FLUSH in CR2 release 
>> [1].
>> 
>> Vladimir
>> 
>> [1] https://issues.jboss.org/browse/ISPN-83
>> 
>> 
>> On 11-02-23 11:18 AM, Paul Ferraro wrote:
>>> After testing AS6 with Infinispan 4.2.1.CR3, I started seeing a slew of
>>> testsuite failures all due to state transfer timeouts.  These are
>>> REPL_SYNC tests, with state transfer enabled:
>>> e.g.
>>> 
>>> 2011-02-23 10:59:46,771 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Initiating state transfer process
>>> 2011-02-23 10:59:46,771 INFO  [org.infinispan.remoting.rpc.RpcManagerImpl] Trying to fetch state from T510-29294
>>> 2011-02-23 10:59:46,773 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] GET_STATE: asking T510-29294 for state, passing down a SUSPEND_STABLE event, timeout=10000
>>> 2011-02-23 10:59:46,773 DEBUG [org.jgroups.protocols.pbcast.STABLE] suspending message garbage collection
>>> 2011-02-23 10:59:46,773 DEBUG [org.jgroups.protocols.pbcast.STABLE] resume task started, max_suspend_time=11000
>>> 2011-02-23 10:59:46,778 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] StateProviderThreadSpawner listening at 192.168.0.3:37173...
>>> 2011-02-23 10:59:46,778 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Responding to state requester T510-23375 with address 192.168.0.3:37173 and digest T510-23375: [0 : 0 (0)], T510-29294: [0 : 1 (1)]
>>> 2011-02-23 10:59:46,781 DEBUG [org.jgroups.protocols.pbcast.NAKACK]
>>> [overwriteDigest()]
>>> existing digest:  T510-23375: [0 : 0 (0)], T510-29294: [0 : 1 (1)]
>>> new digest:       T510-29294: [0 : 1 (1)], T510-23375: [0 : 0 (0)]
>>> resulting digest: T510-23375: [0 : 0 (0)], T510-29294: [0 : 1 (1)]
>>> 2011-02-23 10:59:46,781 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Connecting to state provider /192.168.0.3:37173, original buffer size was 43690 and was reset to 8192
>>> 2011-02-23 10:59:46,782 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Connected to state provider, my end of the socket is /192.168.0.3:38305 passing inputstream up...
>>> 2011-02-23 10:59:46,783 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] StateProviderThreadSpawner listening at 192.168.0.3:37173...
>>> 2011-02-23 10:59:46,783 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Accepted request for state transfer from /192.168.0.3:38305 handing of to PooledExecutor thread
>>> 2011-02-23 10:59:46,786 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] Running on Thread[STREAMING_STATE_TRANSFER-sender-1,T510-29294,5,JGroups]. Accepted request for state transfer from /192.168.0.3:38305, original buffer size was 25434 and was reset to 8192, passing outputstream up...
>>> 2011-02-23 10:59:46,789 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Applying state
>>> 2011-02-23 10:59:46,789 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Generating state.  Can provide? true
>>> 2011-02-23 10:59:46,822 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] Writing 0 StoredEntries to stream
>>> 2011-02-23 10:59:46,833 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] State generated, closing object stream
>>> 2011-02-23 10:59:46,834 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] State writer is closing the socket
>>> 2011-02-23 10:59:56,774 DEBUG [org.jgroups.protocols.pbcast.STABLE] resuming message garbage collection
>>> 2011-02-23 10:59:56,834 WARN  [org.infinispan.remoting.rpc.RpcManagerImpl] Could not find available peer for state, backing off and retrying
>>> 2011-02-23 10:59:56,837 ERROR [org.infinispan.remoting.transport.jgroups.JGroupsTransport] Caught while requesting or applying state
>>> org.infinispan.statetransfer.StateTransferException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from T510-29294
>>>         at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:332)
>>>         at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:199)
>>>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:582)
>>>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:712)
>>>         at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
>>>         at org.jgroups.JChannel.up(JChannel.java:1422)
>>>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
>>>         at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:525)
>>>         at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:464)
>>>         at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)
>>>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:190)
>>>         at org.jgroups.protocols.FlowControl.up(FlowControl.java:419)
>>>         at org.jgroups.protocols.FlowControl.up(FlowControl.java:401)
>>>         at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
>>>         at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
>>>         at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:615)
>>>         at org.jgroups.protocols.UNICAST.up(UNICAST.java:295)
>>>         at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
>>>         at org.jgroups.protocols.BARRIER.up(BARRIER.java:120)
>>>         at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:169)
>>>         at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:269)
>>>         at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
>>>         at org.jgroups.protocols.Discovery.up(Discovery.java:292)
>>>         at org.jgroups.protocols.PING.up(PING.java:67)
>>>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1093)
>>>         at org.jgroups.protocols.TP.access$100(TP.java:56)
>>>         at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1633)
>>>         at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1615)
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>         at java.lang.Thread.run(Thread.java:679)
>>> Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from T510-29294
>>>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:267)
>>>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:116)
>>>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:394)
>>>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:102)
>>>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:126)
>>>         at org.infinispan.statetransfer.StateTransferManagerImpl.mimicPartialFlushViaRPC(StateTransferManagerImpl.java:300)
>>>         at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:253)
>>>         at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:321)
>>>         ... 30 more
>>> 
>>> The problematic tests use the default jgroups configuration
>>> (jgroups-udp.xml from the infinispan-core jar).  If I override the
>>> default jgroups config and add FLUSH, these tests pass.
>>> Note that there is no cache activity on the 1st cache prior to the time
>>> the 2nd node joins and starts its cache.
>>> Does the cache need to be configured differently if FLUSH is not used?
>>> 
>>> Thoughts?
>>> 
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
> 
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache




More information about the infinispan-dev mailing list