[jboss-jira] [JBoss JIRA] (JGRP-2262) "Frozen" coordinator causes the whole cluster to hang
Sibin Karnavar (JIRA)
issues at jboss.org
Tue Apr 24 10:28:01 EDT 2018
[ https://issues.jboss.org/browse/JGRP-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566227#comment-13566227 ]
Sibin Karnavar commented on JGRP-2262:
--------------------------------------
Thanks for creating this JIRA, I was about to create an another one but found this similar issue.
I have faced similar problem. I have my thread dump patsted here. I am using 4.0.10 version. It is not reproducible every time.
1) I was having 3 cluster nodes.
1) I have started all my nodes together (This was due to a deployment of my service in Amazon)
2) After the restart , I have checked my database table. I was not able to see the new ping inserts instead I was able to see the old coordinator entry in the database ( Usually it used to clear the db entry because new coordinator removes it and rewrite new information's back. I have remove_all_data_on_view_change=true )
3) But my cluster was still working with 2 cluster nodes. The third cluster node was in waiting state to start. I am attaching my thread dump. I feel like it is similar issue as mentioned above in this JIRA. its waiting at the ClientGmsImpl.java:93 till i refresh the DB entry by changing the cluster coordinator. As soon as I change my cluster coordinator, JGroup is clearing again the DB and re writing the latest view on the DB. Post this, the hung node is starting from where it stopped.
"localhost-startStop-1" #18 daemon prio=5 os_prio=0 tid=0x00007f9268001000 nid=0x4b39 sleeping[0x00007f92a4512000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.jgroups.util.Util.sleep(Util.java:1866)
{color:red} at org.jgroups.protocols.pbcast.ClientGmsImpl.firstOfAllClients(ClientGmsImpl.java:177)
at org.jgroups.protocols.pbcast.ClientGmsImpl.joinInternal(ClientGmsImpl.java:93){color}
at org.jgroups.protocols.pbcast.ClientGmsImpl.join(ClientGmsImpl.java:41)
at org.jgroups.protocols.pbcast.GMS.down(GMS.java:1064)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:206)
at org.jgroups.protocols.FlowControl.down(FlowControl.java:300)
at org.jgroups.protocols.FRAG2.down(FRAG2.java:141)
at org.jgroups.protocols.RSVP.down(RSVP.java:102)
at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:901)
at org.jgroups.JChannel.down(JChannel.java:668)
at org.jgroups.JChannel._connect(JChannel.java:897)
at org.jgroups.JChannel.connect(JChannel.java:393)
- locked <0x00000007cd00cec8> (a org.jgroups.JChannel)
at org.jgroups.JChannel.connect(JChannel.java:384)
- locked <0x00000007cd00cec8> (a org.jgroups.JChannel)
at com.wellmanage.som.clustermanager.jgroup.AbstractClusterManager.connect(AbstractClusterManager.java:198)
at com.wellmanage.som.clustermanager.jgroup.AbstractClusterManager.start(AbstractClusterManager.java:163)
at com.wellmanage.som.clustermanager.ServiceClusterCoordinator.joinCluster(ServiceClusterCoordinator.java:194)
at com.wellmanage.som.statekeeper.DefaultSOMStateKeeper.start(DefaultSOMStateKeeper.java:250)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeCustomInitMethod(AbstractAutowireCapableBeanFactory.java:1835)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1778)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1706)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:583)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:502)
at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:312)
at org.springframework.beans.factory.support.AbstractBeanFactory$$Lambda$99/1286783232.getObject(Unknown Source)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:228)
- locked <0x00000005f80fc130> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:310)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:200)
at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:367)
at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:110)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1613)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1357)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:582)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:502)
at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:312)
at org.springframework.beans.factory.support.AbstractBeanFactory$$Lambda$99/1286783232.getObject(Unknown Source)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:228)
- locked <0x00000005f80fc130> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:310)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:200)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:368)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAutowireCapableBeanFactory.java:1250)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1099)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:545)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:502)
at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:312)
at org.springframework.beans.factory.support.AbstractBeanFactory$$Lambda$99/1286783232.getObject(Unknown Source)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:228)
- locked <0x00000005f80fc130> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:310)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:205)
at org.springframework.boot.web.servlet.ServletContextInitializerBeans.getOrderedBeansOfType(ServletContextInitializerBeans.java:226)
at org.springframework.boot.web.servlet.ServletContextInitializerBeans.getOrderedBeansOfType(ServletContextInitializerBeans.java:214)
at org.springframework.boot.web.servlet.ServletContextInitializerBeans.addServletContextInitializerBeans(ServletContextInitializerBeans.java:91)
at org.springframework.boot.web.servlet.ServletContextInitializerBeans.<init>(ServletContextInitializerBeans.java:79)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.getServletContextInitializerBeans(ServletWebServerApplicationContext.java:250)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.selfInitialize(ServletWebServerApplicationContext.java:237)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext$$Lambda$254/225344427.onStartup(Unknown Source)
at org.springframework.boot.web.embedded.tomcat.TomcatStarter.onStartup(TomcatStarter.java:54)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5204)
- locked <0x00000005f80f8200> (a org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedContext)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
- locked <0x00000005f80f8200> (a org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedContext)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1419)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1409)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00000005f80ff838> (a java.util.concurrent.ThreadPoolExecutor$Worker)
Thanks,
Sibin Karnavar
> "Frozen" coordinator causes the whole cluster to hang
> -----------------------------------------------------
>
> Key: JGRP-2262
> URL: https://issues.jboss.org/browse/JGRP-2262
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.6.7
> Reporter: Pietro Paolini
> Assignee: Bela Ban
> Fix For: 4.0.12
>
> Attachments: jdbc_test.xml, jgroup.zip
>
>
> This is the result of an investigation I carried out for a problem we have experienced within our
> application, the scenario it has been re-created by pausing the JVM using a debugger.
> The discovery mechanism is JDBC_PING.
> If the coordinator's JVM gets fronzen (for whatever reason) before the coordinator sets itself as the cluster coordinator and another node is started after that it will be unable to join the cluster and it will hang indefinitely.
> This seems to be caused by the "continue" statement at
> https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/pbcast/ClientGmsImpl.java:92
> I have prepared a simple application which can help in replicating the problem.
> To replicate the problem :
> 1) Make sure the JGROUPSPING is empty
> 2) Run the application using an IDE and attaching a debugger to cause the JVM to
> be paused at line Main.java:67, wait for it.
> 3) Run the application in non debug mode or with gradle using "gradle run" and it will
> hang indefinitely
> Depending on the UUID/IP Address being used generated/assigned this may not happen all the time but it happened quite often in my local tests.
>
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
More information about the jboss-jira
mailing list