[JBoss JIRA] (SWSQE-107) Openshift Cluster Allocation
by Guilherme Baufaker Rêgo (JIRA)
[ https://issues.jboss.org/browse/SWSQE-107?page=com.atlassian.jira.plugin.... ]
Guilherme Baufaker Rêgo commented on SWSQE-107:
-----------------------------------------------
I think we would go for b21 for production - testing won't be allowed on this cluster, just running production PRs, then would have our openshift cluster for testing.
I have created b11 and b12 based on the premise that "having more openshift cluster would be better than having more performant clusters, in other …
[View More]words, huge clusters)".
I don't know if that premise changed and the for running the testing we would bigger clusters than minimal configuration 1 blade = 1 cluster.
Other options are OSOOS (right now, we are limited by 3 small clusters - I think) and Foreman VMS (we would need to have at least 16GB, 4 cores and 50GB of ram available in order to run a small cluster on foreman - (openshift cluster 2 is running on foreman for quite some time - https://sws-openshift.bc.jonqe.lab.eng.bos.redhat.com:8443/)
> Openshift Cluster Allocation
> ----------------------------
>
> Key: SWSQE-107
> URL: https://issues.jboss.org/browse/SWSQE-107
> Project: Kiali QE
> Issue Type: QE Task
> Reporter: Matt Mahoney
> Assignee: Matt Mahoney
>
> We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
> On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing + Overnight-Testing).
> b21 -> Currently general usage for pipeline automation and manual testing
> b12 -> Newly created, and unused
> b11 -> Newly created, and unused
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
[View Less]
6 years, 11 months
[JBoss JIRA] (SWSQE-107) Openshift Cluster Allocation
by Matt Mahoney (JIRA)
[ https://issues.jboss.org/browse/SWSQE-107?page=com.atlassian.jira.plugin.... ]
Matt Mahoney updated SWSQE-107:
-------------------------------
Description:
We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing+Overnight-Testing).
b21 -> Currently general usage for pipeline automation and manual testing
b12 -&…
[View More]gt; Newly created, and unused
b11 -> Newly created, and unused
was:
We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing).
b21 -> Currently general usage for pipeline automation and manual testing
b12 -> Newly created, and unused
b11 -> Newly created, and unused
> Openshift Cluster Allocation
> ----------------------------
>
> Key: SWSQE-107
> URL: https://issues.jboss.org/browse/SWSQE-107
> Project: Swift Sunshine QE
> Issue Type: QE Task
> Reporter: Matt Mahoney
> Assignee: Matt Mahoney
>
> We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
> On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing+Overnight-Testing).
> b21 -> Currently general usage for pipeline automation and manual testing
> b12 -> Newly created, and unused
> b11 -> Newly created, and unused
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
[View Less]
6 years, 11 months
[JBoss JIRA] (SWSQE-107) Openshift Cluster Allocation
by Matt Mahoney (JIRA)
[ https://issues.jboss.org/browse/SWSQE-107?page=com.atlassian.jira.plugin.... ]
Matt Mahoney updated SWSQE-107:
-------------------------------
Description:
We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing + Overnight-Testing).
b21 -> Currently general usage for pipeline automation and manual testing
b12 -…
[View More]> Newly created, and unused
b11 -> Newly created, and unused
was:
We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing+Overnight-Testing).
b21 -> Currently general usage for pipeline automation and manual testing
b12 -> Newly created, and unused
b11 -> Newly created, and unused
> Openshift Cluster Allocation
> ----------------------------
>
> Key: SWSQE-107
> URL: https://issues.jboss.org/browse/SWSQE-107
> Project: Swift Sunshine QE
> Issue Type: QE Task
> Reporter: Matt Mahoney
> Assignee: Matt Mahoney
>
> We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
> On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing + Overnight-Testing).
> b21 -> Currently general usage for pipeline automation and manual testing
> b12 -> Newly created, and unused
> b11 -> Newly created, and unused
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
[View Less]
6 years, 11 months
[JBoss JIRA] (SWSQE-107) Openshift Cluster Allocation
by Matt Mahoney (JIRA)
[ https://issues.jboss.org/browse/SWSQE-107?page=com.atlassian.jira.plugin.... ]
Matt Mahoney updated SWSQE-107:
-------------------------------
Description:
We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing).
b21 -> Currently general usage for pipeline automation and manual testing
b12 -> Newly created, …
[View More]and unused
b11 -> Newly created, and unused
was:
We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
On one of these clusters, we will want to have our 'Production' CI pipeline.
b21 -> Currently general usage for pipeline automation and manual testing
b12 -> Newly created, and unused
b11 -> Newly created, and unused
> Openshift Cluster Allocation
> ----------------------------
>
> Key: SWSQE-107
> URL: https://issues.jboss.org/browse/SWSQE-107
> Project: Swift Sunshine QE
> Issue Type: QE Task
> Reporter: Matt Mahoney
> Assignee: Matt Mahoney
>
> We now have (3) Openshift clusters, for which we will want to define use allocation for each (aka: how will these cluster be used).
> On one of these clusters, we will want to have our 'Production' CI pipeline (Production=PR-Testing).
> b21 -> Currently general usage for pipeline automation and manual testing
> b12 -> Newly created, and unused
> b11 -> Newly created, and unused
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
[View Less]
6 years, 11 months
[JBoss JIRA] (JGRP-2260) UNICAST3 doesn't remove dead nodes from its tables
by Rich DiCroce (JIRA)
[ https://issues.jboss.org/browse/JGRP-2260?page=com.atlassian.jira.plugin.... ]
Rich DiCroce commented on JGRP-2260:
------------------------------------
I've found the problem.
It stems from the conn_expiry_timeout being set to 0. When that is the case, max_retransmit_time is not applied because the node gets removed from the AgeOutCache when the view changes, and isn't re-added by getSenderEntry(). So the connection is never expired, which means it never goes into CLOSING state and can …
[View More]never transition to CLOSED. I'm not sure if that's a bug or not. On one hand, it effectively creates a memory leak when a node leaves the cluster. On the other hand, a value of 0 is supposed to disable connection reaping, which is exactly what happens. I can give you a modified version of your test that demonstrates all of that, if you want to take a closer look.
But that discovery raised another question, because conn_expiry_timeout doesn't default to 0 and I'm not setting it. Turns out WildFly is "helpfully" doing that for me. Digging through Git blame and JIRA tickets points to ISPN-2805 as the culprit. It looks like Infinispan started setting it to 0 to deal with deficiencies in the first two versions of UNICAST, and WildFly applied the same fix. ISPN-2805 points to JGRP-1586, in which you left a comment saying the problem would be handled by UNICAST3.
So, is UNICAST3 immune to the problem described in JGRP-1586? If yes, it should be safe for me to set conn_expiry_timeout to a non-zero value, and the WildFly and Infinispan teams should probably be made aware of this. Otherwise, I'm not sure what to do about this.
> UNICAST3 doesn't remove dead nodes from its tables
> --------------------------------------------------
>
> Key: JGRP-2260
> URL: https://issues.jboss.org/browse/JGRP-2260
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 4.0.10
> Environment: WildFly 12.0.0.Final
> Reporter: Rich DiCroce
> Assignee: Bela Ban
>
> Scenario: 2 WildFly instances clustered together. A ForkChannel is defined, with a MessageDispatcher on top. I start both nodes, then stop the second one. 6-7 minutes after stopping the second node, I start getting log spam on the first node:
> {quote}
> 12:47:04,519 WARN [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> 12:47:06,522 WARN [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> 12:47:08,524 WARN [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> {quote}
> After some debugging, I discovered that the reason is because UNICAST3 is still trying to retransmit to the dead node. Its send_table still contains an entry for the dead node with state OPEN.
> After looking at the source code for UNICAST3, I have a theory about what's happening.
> * When a node leaves the cluster, down(Event) gets invoked with a view change, which calls closeConnection(Address) for each node that left. That sets the connection state to CLOSING.
> * Suppose that immediately after the view change is handled, a message with the dead node as its destination gets passed to down(Message). That invokes getSenderEntry(Address), which finds the connection... and sets the state back to OPEN.
> Consequently, the connection is never closed or removed from the table, so retransmit attempts continue forever even though they will never succeed.
> This issue is easily reproducible for me, although unfortunately I can't give you the application in question. But if you have fixes you want to try, I'm happy to drop in a patched JAR and see if the issue still happens.
> This is my JGroups subsystem configuration:
> {code:xml}
> <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
> <channels default="ee">
> <channel name="ee" stack="main">
> <fork name="shared-dispatcher"/>
> <fork name="group-topology"/>
> </channel>
> </channels>
> <stacks>
> <stack name="main">
> <transport type="UDP" socket-binding="jgroups" site="${gp.site:DEFAULT}"/>
> <protocol type="PING"/>
> <protocol type="MERGE3">
> <property name="min_interval">
> 1000
> </property>
> <property name="max_interval">
> 5000
> </property>
> </protocol>
> <protocol type="FD_SOCK"/>
> <protocol type="FD_ALL2">
> <property name="interval">
> 3000
> </property>
> <property name="timeout">
> 8000
> </property>
> </protocol>
> <protocol type="VERIFY_SUSPECT"/>
> <protocol type="pbcast.NAKACK2"/>
> <protocol type="UNICAST3"/>
> <protocol type="pbcast.STABLE"/>
> <protocol type="pbcast.GMS">
> <property name="join_timeout">
> 100
> </property>
> </protocol>
> <protocol type="UFC"/>
> <protocol type="MFC"/>
> <protocol type="FRAG3"/>
> </stack>
> </stacks>
> </subsystem>
> {code}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
[View Less]
6 years, 11 months