[infinispan-issues] [JBoss JIRA] (ISPN-6388) Spark integration - TimeoutException: Replication timeout on application execution

Fri Apr 22 09:25:00 EDT 2016

    [ https://issues.jboss.org/browse/ISPN-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195777#comment-13195777 ] 

Gustavo Fernandes commented on ISPN-6388:
-----------------------------------------

Here's my theory of what happened in your test.

There were failures during the iteration: either a server was down or for some reason it stopped responding, maybe due to GC (it does not matter the reason). 
When such failures occur, there is a retry with the segments that were not done, and since from the logs you were using the Hot Rod client version 8.1.0.Final, it was being affected by https://issues.jboss.org/browse/ISPN-6234, where after a failover it would retry with the wrong segments. Since the segments were wrong, the iteration would not be confined to the local server where it contacted, causing remote RPC to obtain the segments, ultimately provoking a cascade effect resulting on timeouts. 

I believe the timeouts should not occur anymore (I was not able to reproduce), could you maybe test again with Infinispan 8.2.1.Final (both client and server) and the Spark connector 0.3?

> Spark integration - TimeoutException: Replication timeout on application execution 
> -----------------------------------------------------------------------------------
>
>                 Key: ISPN-6388
>                 URL: https://issues.jboss.org/browse/ISPN-6388
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 8.2.0.Final
>            Reporter: Matej Čimbora
>            Assignee: Gustavo Fernandes
>         Attachments: app_0.txt, driver.txt, server.txt
>
>
> The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).
> Using 8 node cluster (1 worker/1 ISPN server per physical node), connector v. 0.2.
> Attached sample driver, server, application logs.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)