[infinispan-issues] [JBoss JIRA] Commented: (ISPN-1016) Hash-aware view update causing lock ups in Hot Rod

Thu Mar 31 06:13:37 EDT 2011

    [ https://issues.jboss.org/browse/ISPN-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592906#comment-12592906 ] 

Galder Zamarreño commented on ISPN-1016:
----------------------------------------

I think the view replace call while encoding is a bad idea because on one side it means if multiple requests are coming in parallel, they could all be trying to do the same. Also, since we have fixed positions, such hash recalculation does not make sense for anyone. The only node that might not have a hash id is the node that joined in and that can already be calculated when the node starts up, bearing in mind that all defined caches are started when server starts app. So, there's no reason hash-aware topology address cannot be calculated on startup, removing a potential deadlock from the response encoding path.

I'm starting to think that starting pre-defined caches and not allowing dynamic ones is a good thing, since it simplifies this in a big way :)

> Hash-aware view update causing lock ups in Hot Rod
> --------------------------------------------------
>
>                 Key: ISPN-1016
>                 URL: https://issues.jboss.org/browse/ISPN-1016
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Cache Server
>    Affects Versions: 4.2.1.FINAL
>            Reporter: Galder Zamarreño
>            Assignee: Galder Zamarreño
>             Fix For: 4.2.2.BETA1, 5.0.0.BETA1
>
>
> When encoding a Hot Rod response, if the encoder discovers that the client has an old view, it decides that a new topology needs to be sent to the client. Now, when building this view in distributed caches, the encoder checks whether in the new view, any of the nodes hash ids has changed, and if so, it sends a cluster wide replace with the view containing the new hash ids.
> This seems to cause some deadlocks as shown in JBPAPP-6113 where one node is timing to send the replace to other node, and another node is timing out doing the same. This needs further thinking but have some ideas in mind...
> On top of that, it appears that a failure here is causing problems the requests after that, so some thinking needs to be done to see if that replace() call can be moved out of there...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira