[jboss-jira] [JBoss JIRA] (WFWIP-176) Pod restarted because of failing liveness/rediness Probe

Tue Aug 20 06:01:01 EDT 2019

    [ https://issues.jboss.org/browse/WFWIP-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772868#comment-13772868 ] 

Martin Choma commented on WFWIP-176:
------------------------------------

[~brian.stansberry]

It is very likely it is related. But I still does not understand fully. In test configuration is bad, so livenessProbe.sh cant passed. But I still wonder why I cant see failing livenessProbe.sh in events. Seems like response of livenessProbe.sh is not processed by openshift at all. After 6 minutes in events shows up `context deadline exceeded` and after 10 minutes pod is killed by deploy pod. 

Events from CD16
{code}
11:58:33 AM 	Normal 	Killing  	Killing container with id docker://weirdusername:Need to kill Pod
11:58:32 AM 	Warning 	Unhealthy  	Liveness probe failed:
11:58:32 AM 	Warning 	Unhealthy  	Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped
11:58:32 AM 	Warning 	Unhealthy  	Readiness probe failed:
11:54:34 AM 	Warning 	Unhealthy  	Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
11:54:34 AM 	Warning 	Unhealthy  	Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
11:48:30 AM 	Normal 	Started  	Started container
11:48:29 AM 	Normal 	Created  	Created container
11:48:29 AM 	Normal 	Pulled  	Successfully pulled image "docker-registry.default.svc:5000/mchoma-xtf-builds/ha-servlet-counter-eap-cd at sha256:7f697f40699251c2d9df69c6622255dde212554ad22d7705842895fee204ffef"
11:48:29 AM 	Normal 	Pulling  	pulling image "docker-registry.default.svc:5000/mchoma-xtf-builds/ha-servlet-counter-eap-cd at sha256:7f697f40699251c2d9df69c6622255dde212554ad22d7705842895fee204ffef"
11:48:11 AM 	Normal 	Scheduled  	Successfully assigned mchoma/weirdusername-1-g5jkk to multinode-nfs-mchoma-001-node-2
{code}

User impact is minimal in both cases user is unable to deploy the app so I think we can ignore it. I will just change test behaviour.

> Pod restarted because of failing liveness/rediness Probe
> --------------------------------------------------------
>
>                 Key: WFWIP-176
>                 URL: https://issues.jboss.org/browse/WFWIP-176
>             Project: WildFly WIP
>          Issue Type: Bug
>          Components: OpenShift
>            Reporter: Martin Choma
>            Assignee: Ken Wills
>            Priority: Major
>
> During testing 73 image I came to case where really corner case is tested [0].
> Test is not using templates for deployment. 
> In tested scenario liveness/readiness probe fails. In CD 17 and eap 73 pod is restarted. In CD 16 however, there was no liveness/rediness failures in events. Pod was not restarted. 
> I dont see differences in pod yaml for CD16 case
> {code}
>       livenessProbe:
>         exec:
>           command:
>             - /bin/bash
>             - '-c'
>             - /opt/eap/bin/livenessProbe.sh
>         failureThreshold: 3
>         periodSeconds: 10
>         successThreshold: 1
>         timeoutSeconds: 1
>       name: weirdusername
>       readinessProbe:
>         exec:
>           command:
>             - /bin/bash
>             - '-c'
>             - /opt/eap/bin/readinessProbe.sh
>         failureThreshold: 3
>         periodSeconds: 10
>         successThreshold: 1
>         timeoutSeconds: 1
> {code}
> and CD 17 case
> {code}
>      livenessProbe:
>         exec:
>           command:
>             - /bin/bash
>             - '-c'
>             - /opt/eap/bin/livenessProbe.sh
>         failureThreshold: 3
>         periodSeconds: 10
>         successThreshold: 1
>         timeoutSeconds: 1
>       name: weirdusername
>       readinessProbe:
>         exec:
>           command:
>             - /bin/bash
>             - '-c'
>             - /opt/eap/bin/readinessProbe.sh
>         failureThreshold: 3
>         periodSeconds: 10
>         successThreshold: 1
>         timeoutSeconds: 1
> {code}
> What could cause this behaviour change? 
> [0] https://issues.jboss.org/browse/CLOUD-1988

--
This message was sent by Atlassian Jira
(v7.12.1#712002)