[Red Hat JIRA] (ISPN-12652) Cluster is broken after one node is down

Friday, 29 January 2021

     [
https://issues.redhat.com/browse/ISPN-12652?page=com.atlassian.jira.plugi...
]

Dmitry Kruglikov updated ISPN-12652:
------------------------------------
    Description: 
We have 3 nodes in cluster: app1, app2 and app3. App1 was shut down not gracefully because
of some hardware issue. After that app2 and app3 started to fail with something like

 {{{{ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor]
(timeout-thread--p23-t1) ISPN000136: Error executing command RemoveCommand on Cache
'fs.war', writing keys
[SessionCreationMetaDataKey(PGARVVdjGKfifzrVfyd7HAllbrwaRG7wLhKha1On)]:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 422657 from app1}}}}
{{ \{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)}}}}
{{ \{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)}}}}
{{ \{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)}}}}
{{ \{{ at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)}}}}
{{ \{{ at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)}}}}
{{ \{{ at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}}}
{{ \{{ at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}}}
{{ \{{ at java.base/java.lang.Thread.run(Thread.java:834)}}}}

 So these 2 nodes (app2 and app3) could not serve user requests anymore until app1
recovered. My question is... Is it ok? Should not Infinispan identify that one of nodes is
down, remove it from cluster and notify app2 and app3 about it? I know that there is
something like VERIFY_SUSPECT but it didn't happen.

  was:
We have 3 nodes in cluster: app1, app2 and app3. App1 was shut down not gracefully because
of some hardware issue. After that app2 and app3 started to fail with something like

 {{ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor]
(timeout-thread--p23-t1) ISPN000136: Error executing command RemoveCommand on Cache
'fs.war', writing keys
[SessionCreationMetaDataKey(PGARVVdjGKfifzrVfyd7HAllbrwaRG7wLhKha1On)]:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 422657 from app1}}
{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)}}
{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)}}
{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)}}
{{ at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)}}
{{ at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)}}
{{ at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
{{ at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
{{ at java.base/java.lang.Thread.run(Thread.java:834)}}

 So these 2 nodes (app2 and app3) could not serve user requests anymore until app1
recovered. My question is... Is it ok? Should not Infinispan identify that one of nodes is
down, remove it from cluster and notify app2 and app3 about it? I know that there is
something like VERIFY_SUSPECT but it didn't happen.

...
 Cluster is broken after one node is down
 ----------------------------------------

                 Key: ISPN-12652
                 URL: https://issues.redhat.com/browse/ISPN-12652
             Project: Infinispan
          Issue Type: Bug
    Affects Versions: 9.4.14.Final
            Reporter: Dmitry Kruglikov
            Priority: Blocker

 We have 3 nodes in cluster: app1, app2 and app3. App1 was shut down not gracefully
because of some hardware issue. After that app2 and app3 started to fail with something
like

  {{{{ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor]
(timeout-thread--p23-t1) ISPN000136: Error executing command RemoveCommand on Cache
'fs.war', writing keys
[SessionCreationMetaDataKey(PGARVVdjGKfifzrVfyd7HAllbrwaRG7wLhKha1On)]:
org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for
responses for request 422657 from app1}}}}
 {{ \{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)}}}}
 {{ \{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)}}}}
 {{ \{{ at
org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)}}}}
 {{ \{{ at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)}}}}
 {{ \{{ at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)}}}}
 {{ \{{ at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}}}
 {{ \{{ at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}}}
 {{ \{{ at java.base/java.lang.Thread.run(Thread.java:834)}}}}

  So these 2 nodes (app2 and app3) could not serve user requests anymore until app1
recovered. My question is... Is it ok? Should not Infinispan identify that one of nodes is
down, remove it from cluster and notify app2 and app3 about it? I know that there is
something like VERIFY_SUSPECT but it didn't happen.

--
This message was sent by Atlassian Jira
(v8.13.1#813001)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009