[jboss-jira] [JBoss JIRA] (WFLY-5135) Infinispan cluster topology rebalance

Mon Nov 9 11:59:00 EST 2015

     [ https://issues.jboss.org/browse/WFLY-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Ferraro resolved WFLY-5135.
--------------------------------
    Fix Version/s: 10.0.0.CR3
       Resolution: Done


Fixed by https://github.com/wildfly/wildfly/commit/25e62d6c854fb567aa743f3828ef34b8916c9116#diff-ae80be659295f932d6888cb6b5c29320

> Infinispan cluster topology rebalance
> -------------------------------------
>
>                 Key: WFLY-5135
>                 URL: https://issues.jboss.org/browse/WFLY-5135
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 9.0.1.Final
>         Environment: CentOS 7, Java 8u51, Wildfly 9.0.1.Final, TCP/TCPPING cluster
>            Reporter: Gabor Auth
>            Assignee: Paul Ferraro
>             Fix For: 10.0.0.CR3
>
>
> Configured TCPPING based TCP transport:
> {code}
> <stack name="tcpping">
>   <transport type="TCP" socket-binding="jgroups-tcp"/>
>   <protocol type="TCPPING">
>     <property name="port_range">0</property>
>     <property name="initial_hosts">...</property>
>   </protocol>
>   <protocol type="MERGE2"/>
>   <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
>   <protocol type="FD"/>
>   <protocol type="VERIFY_SUSPECT"/>
>   <protocol type="BARRIER"/>
>   <protocol type="pbcast.NAKACK"/>
>   <protocol type="UNICAST2"/>
>   <protocol type="pbcast.STABLE"/>
>   <protocol type="pbcast.GMS">
>     <property name="join_timeout">3000</property>
>   </protocol>
>   <protocol type="MFC"/>
>   <protocol type="FRAG2"/>
>   <protocol type="RSVP"/>
> </stack>
> {code}
> The cluster works but after some node restart, one of the node try to acquire the coordinator role and fail (the previous coordinator was the dc02-wild01):
> {code}
> 2015-08-19 10:00:41,597 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-16,ee,dc01-wild01) ISPN000094: Received new cluster view for channel gacivsContainer: [dc02-wild02|13] (4) [dc02-wild02, dc01-wild01, dc01-wild02, dc02-wild01]
> 2015-08-19 10:00:41,598 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-16,ee,dc01-wild01) ISPN000094: Received new cluster view for channel web: [dc02-wild02|13] (4) [dc02-wild02, dc01-wild01, dc01-wild02, dc02-wild01]
> 2015-08-19 10:00:41,598 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-16,ee,dc01-wild01) ISPN000094: Received new cluster view for channel ejb: [dc02-wild02|13] (4) [dc02-wild02, dc01-wild01, dc01-wild02, dc02-wild01]
> 2015-08-19 10:00:46,448 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-19,ee,dc01-wild01) ISPN000093: Received new, MERGED cluster view for channel gacivsContainer: MergeView::[dc01-wild01|14] (4) [dc01-wild01, dc02-wild01, dc01-wild02, dc02-wild02], 2 subgroups: [dc02-wild02|12] (3) [dc02-wild02, dc01-wild01, dc01-wild02], [dc02-wild02|13] (4) [dc02-wild02, dc01-wild01, dc01-wild02, dc02-wild01]
> 2015-08-19 10:00:46,450 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-19,ee,dc01-wild01) ISPN000093: Received new, MERGED cluster view for channel web: MergeView::[dc01-wild01|14] (4) [dc01-wild01, dc02-wild01, dc01-wild02, dc02-wild02], 2 subgroups: [dc02-wild02|12] (3) [dc02-wild02, dc01-wild01, dc01-wild02], [dc02-wild02|13] (4) [dc02-wild02, dc01-wild01, dc01-wild02, dc02-wild01]
> 2015-08-19 10:00:46,450 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-19,ee,dc01-wild01) ISPN000093: Received new, MERGED cluster view for channel ejb: MergeView::[dc01-wild01|14] (4) [dc01-wild01, dc02-wild01, dc01-wild02, dc02-wild02], 2 subgroups: [dc02-wild02|12] (3) [dc02-wild02, dc01-wild01, dc01-wild02], [dc02-wild02|13] (4) [dc02-wild02, dc01-wild01, dc01-wild02, dc02-wild01]
> 2015-08-19 10:00:46,637 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p4-t8) ISPN000196: Failed to recover cluster state after the current node became the coordinator: org.infinispan.commons.CacheException: Unsuccessful response received from node dc02-wild01: CacheNotFoundResponse
>         at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:482)
>         at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:350)
>         at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:286)
>         at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:590)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> And the cluster gone...
> {code}
> 2015-08-19 10:10:17,312 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread--p5-t49) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key 113481734232139653919 and requestor GlobalTransaction:<dc01-wild01>:628:remote. Lock is held by GlobalTransaction:<dc01-wild01>:602:remote, while request came from dc01-wild01
> [...]
> 2015-08-19 10:13:32,295 WARNING [org.jgroups.protocols.TCP] (TransferQueueBundler,ee,dc01-wild01) JGRP000032: dc01-wild01: no physical address for dc02-wild01, dropping message
> [...]
> 2015-08-19 10:11:32,707 WARN  [org.infinispan.transaction.impl.TransactionTable] (TxCleanupService,dist,dc01-wild02) ISPN000326: Remote transaction GlobalTransaction:<dc02-wild02>:1240:remote timed out. Rolling back after 84623 ms
> {code}


--
This message was sent by Atlassian JIRA
(v6.4.11#64026)