[infinispan-issues] [JBoss JIRA] (ISPN-3947) HotRod client keep trying recover connections to a failed cluster

Thu Jan 30 04:33:28 EST 2014

    [ https://issues.jboss.org/browse/ISPN-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12939962#comment-12939962 ] 

Dan Berindei commented on ISPN-3947:
------------------------------------

I don't think configuring a number of retries per server is a good idea, it would mean that actual timeout increases linearly with the number of servers - just like it does now. I think a total number of retries and/or a total timeout would be much better.

Note that with the default configuration (i.e. `testOnBorrow == false`), only the first attempt is done on the primary owner of the key. All the other attempts use a random server. (Actually it's round-robin, but the state is shared, so for an individual thread it would look random.) A setting named `retriesPerServer` would make the me think that that's the number of retries on the server before trying another.

Also, I haven't tested this, but with `testOnBorrow == true` I think the pool will catch the timeout in the ping operation and retry `maxActive` times internally, before the HotRod client does its own retrying in `RetryOnFailureOperation`. We can probably ignore it, since enabling `testOnBorrow` would be bad for performance anyway, but we should probably document it.

> HotRod client keep trying recover connections to a failed cluster
> -----------------------------------------------------------------
>
>                 Key: ISPN-3947
>                 URL: https://issues.jboss.org/browse/ISPN-3947
>             Project: Infinispan
>          Issue Type: Feature Request
>          Components: Remote Protocols
>    Affects Versions: 6.0.1.Final, 7.0.0.Alpha1
>            Reporter: Wolf-Dieter Fink
>            Assignee: Galder Zamarreño
>              Labels: hotrod, hotrod-java-client
>
> If an infinispan-server cluster is not longer reachable for some reason, i.e. network disconnect, the hot-rod client try to re-establish the lost connections.
> The client library will retry this by a fixed calculation based on the max numbers of connections from the pool or 10 multiplied with the number of available servers.
> This can lead in a very long time until the application can continue and react as it will wait for the read- or connect-timeout for each try.
> To improve this behaviour there should be a configurable limit of retries per server and/or a timeout in total.
> This will give the application the chance to handle a remote-cache failure and reply to the user instead of hanging for minutes (with the default settings)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira