[JBoss JIRA] (ISPN-7489) org.jgroups.protocols.TCP emits errors when node leaves the cluster
by Sebastian Łaskawiec (JIRA)
[ https://issues.jboss.org/browse/ISPN-7489?page=com.atlassian.jira.plugin.... ]
Sebastian Łaskawiec commented on ISPN-7489:
-------------------------------------------
According to [Kubernetes documentation|https://kubernetes.io/docs/user-guide/pods/#termination-of-pods]:
{quote}
(simultaneous with 3), Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication controllers. Pods that shutdown slowly can continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
{quote}
So Kube doesn't seal it off the network. Perhaps we're doing it on our own?
> org.jgroups.protocols.TCP emits errors when node leaves the cluster
> -------------------------------------------------------------------
>
> Key: ISPN-7489
> URL: https://issues.jboss.org/browse/ISPN-7489
> Project: Infinispan
> Issue Type: Bug
> Components: Cloud Integrations, Core
> Affects Versions: 9.0.0.CR1
> Environment: * OpenShift {{v1.5.0-alpha.2+e4b43ee}}
> * Custom Infinispan Server build (based on [these instructions|https://github.com/slaskawi/infinispan-1/tree/custom_image]). SHA1 {{2b0731b21649a88a75ed71d21b9cc06ba365e947}}
> Reporter: Sebastian Łaskawiec
>
> When I was performing [Spring Session and Kubernetes Rolling Update demo|https://bluejeans.com/s/pYKUg/] I encountered a couple of problems.
> One of the is this:
> {noformat}
> [transactions-repository-1-04x09] 18:09:12,193 ERROR [org.jgroups.protocols.TCP] (jgroups-30,transactions-repository-1-04x09) JGRP000029: transactions-repository-1-04x09: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=5262, TP: [cluster_name=cluster]
> [transactions-repository-1-1f8dx] 18:09:12,310 ERROR [org.jgroups.protocols.TCP] (jgroups-16,transactions-repository-1-1f8dx) JGRP000029: transactions-repository-1-1f8dx: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=6259, TP: [cluster_name=cluster]
> [transactions-repository-1-04x09] 18:09:12,997 ERROR [org.jgroups.protocols.TCP] (jgroups-22,transactions-repository-1-04x09) JGRP000029: transactions-repository-1-04x09: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=5262, TP: [cluster_name=cluster]
> [transactions-repository-1-1f8dx] 18:09:13,113 ERROR [org.jgroups.protocols.TCP] (jgroups-16,transactions-repository-1-1f8dx) JGRP000029: transactions-repository-1-1f8dx: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=6259, TP: [cluster_name=cluster]
> {noformat}
> Full logs from Rolling Update process might be found here: https://gist.github.com/slaskawi/530241bb695f1f490bcb25eabaf9d676
> Steps to reproduce:
> * Start local OpenShift Cluster
> * invoke `./init_infrastructure.sh` from https://github.com/slaskawi/presentations/tree/ISPN-7487-reproducer
> * invoke `cd transaction-creator && mvn fabric8:run`
> * Do the rolling update: `oc deploy transactions-repository --latest -n myproject`
> * Observe logs `kubetail -l environment=infrastructure`
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
9 years, 1 month
[JBoss JIRA] (ISPN-7489) org.jgroups.protocols.TCP emits errors when node leaves the cluster
by Sebastian Łaskawiec (JIRA)
[ https://issues.jboss.org/browse/ISPN-7489?page=com.atlassian.jira.plugin.... ]
Sebastian Łaskawiec commented on ISPN-7489:
-------------------------------------------
[~belaban] explained me what's causing those problems:
{quote}
you're trying to connect to a server but are not able to do so within TCP.sock_conn_timeout ms (default: 2000). The SocketTimeoutExceptions are all on socket.connect().
{quote}
So it seems the old nodes (which are meant to be killed) are still in the cluster view. From further investigation I can see that there's something wrong with node shutdown procedure:
{code}
[transactions-repository-1-5zjqt] *** JBossAS process (84) received TERM signal ***
[transactions-repository-1-40mc9] [GC (Allocation Failure) 619101K->324299K(1013632K), 0.0197678 secs]
[transactions-repository-1-40mc9] 12:58:16,981 ERROR [org.jgroups.protocols.TCP] (jgroups-25,transactions-repository-1-40mc9) JGRP000029: transactions-repository-1-40mc9: failed sending message to transactions-repository-1-5zjqt (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=3994, TP: [cluster_name=cluster]
{code}
Node {{transactions-repository-1-5zjqt}} receives TERM signal (Kubernetes asks it shut down gracefully) and since then, all messages to that node fails.
> org.jgroups.protocols.TCP emits errors when node leaves the cluster
> -------------------------------------------------------------------
>
> Key: ISPN-7489
> URL: https://issues.jboss.org/browse/ISPN-7489
> Project: Infinispan
> Issue Type: Bug
> Components: Cloud Integrations, Core
> Affects Versions: 9.0.0.CR1
> Environment: * OpenShift {{v1.5.0-alpha.2+e4b43ee}}
> * Custom Infinispan Server build (based on [these instructions|https://github.com/slaskawi/infinispan-1/tree/custom_image]). SHA1 {{2b0731b21649a88a75ed71d21b9cc06ba365e947}}
> Reporter: Sebastian Łaskawiec
>
> When I was performing [Spring Session and Kubernetes Rolling Update demo|https://bluejeans.com/s/pYKUg/] I encountered a couple of problems.
> One of the is this:
> {noformat}
> [transactions-repository-1-04x09] 18:09:12,193 ERROR [org.jgroups.protocols.TCP] (jgroups-30,transactions-repository-1-04x09) JGRP000029: transactions-repository-1-04x09: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=5262, TP: [cluster_name=cluster]
> [transactions-repository-1-1f8dx] 18:09:12,310 ERROR [org.jgroups.protocols.TCP] (jgroups-16,transactions-repository-1-1f8dx) JGRP000029: transactions-repository-1-1f8dx: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=6259, TP: [cluster_name=cluster]
> [transactions-repository-1-04x09] 18:09:12,997 ERROR [org.jgroups.protocols.TCP] (jgroups-22,transactions-repository-1-04x09) JGRP000029: transactions-repository-1-04x09: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=5262, TP: [cluster_name=cluster]
> [transactions-repository-1-1f8dx] 18:09:13,113 ERROR [org.jgroups.protocols.TCP] (jgroups-16,transactions-repository-1-1f8dx) JGRP000029: transactions-repository-1-1f8dx: failed sending message to transactions-repository-1-4z05w (71 bytes): java.net.SocketTimeoutException: connect timed out, headers: GMS: GmsHeader[VIEW_ACK], UNICAST3: DATA, seqno=6259, TP: [cluster_name=cluster]
> {noformat}
> Full logs from Rolling Update process might be found here: https://gist.github.com/slaskawi/530241bb695f1f490bcb25eabaf9d676
> Steps to reproduce:
> * Start local OpenShift Cluster
> * invoke `./init_infrastructure.sh` from https://github.com/slaskawi/presentations/tree/ISPN-7487-reproducer
> * invoke `cd transaction-creator && mvn fabric8:run`
> * Do the rolling update: `oc deploy transactions-repository --latest -n myproject`
> * Observe logs `kubetail -l environment=infrastructure`
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
9 years, 1 month