[wildfly-dev] Re: WildFly Cloud Tests

Friday, 6 September 2024

Nice!

If we ever revisit moving these from wildfly-extras, it occurs to me that
while that would add 3 more relatively long-running jobs per PR, what
you've done here eliminates one: the current job that does the polling. So
it's a net add of only 2 jobs, not 3.

On Fri, Sep 6, 2024 at 8:07 AM Kabir Khan <kkhan(a)redhat.com&gt; wrote:

...
 I have improved the error reporting mechanism now.

 I did this in two phases
 * https://github.com/wildfly/wildfly/pull/18172 +
 https://github.com/wildfly-extras/wildfly-cloud-tests/pull/194
 introduced a new mechanism
 * https://github.com/wildfly/wildfly/pull/18174 +
 https://github.com/wildfly-extras/wildfly-cloud-tests/pull/195 removed
 the old mechanism

 Previously the Cloud Tests Trigger workflow triggered by WIldFly PRs would
 wait until the test on the wildfly-cloud-tests side had completed. The
 remote job would communicate the status of the job via a push to a branch
 that the trigger was polling and monitoring for the commit with the status.
 I was never happy with this approach, and came across another way while
 adding CI somewhere else.

 What happens now is the Cloud Tests Trigger issues a repository dispatch
 against the WIldFly Cloud Tests repository. This is the same as before, but
 now it returns immediately after the dispatch.
 The dispatch is done cloud-test-pr-trigger.yml workflow

<https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;,
 and the *trigger-cloud-tests-pr* event is handled on the cloud tests side
 by the wildfly-pull-request-runner.yml workflow

<https://github.com/wildfly-extras/wildfly-cloud-tests/blob/main/.github/w...
 .

 The first thing that the wildfly-pull-request-runner.yml workflow does is,
 is a repository dispatch back to the wildfly repository to set the status
 of the job as pending

<https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;.
 The *report-cloud-tests-pr-pending* event type is handled on the
 wildlfy side by cloud-test-pr-reporter.yml

<https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;,
 which executes a call to add the status.

 Once this is done, in the original PR, we see the 'Cloud Tests Trigger'
 job has completed, and there is a new entry called 'Cloud Tests Remote
 Run', which is in the pending status:
 [image: Screenshot 2024-09-06 at 13.51.50.png]

 The 'Details' link for 'Cloud Tests Remote Run' takes you to the
workflow
 run on the wildfly-cloud-tests side.

 Once all the tests are run, the cloud tests
 wildfly-pull-request-runner.yml reports the job status back to wildfly,
 with another repository dispatch

<https://github.com/wildfly-extras/wildfly-cloud-tests/blob/main/.github/w...;.
 Again, the *report-cloud-tests-pr-complete* event type is handled on the
 wildfly side by cloud-test-pr-reporter.yml

<https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;,
 which executes a call to update the status for the job on the PR. In this
 case the job passed 🥳:
 [image: Screenshot 2024-09-06 at 14.45.28.png]
 As before the 'Details' link takes you back to the job run on the cloud
 tests side.

 A small niggle is that the concurrency check on the WIldFly cloud tests
 side will cancel all running jobs.

What's the concurrency check?

Ah, as I write my brain guesses that it's the thing that happens if the PR
branch is pushed again while jobs for a previous push are still running.

This currently causes the status reported back to be 'failed' due to
...
 something I still need to figure out. Ideally that should be
cancelled.
 However, this is a bit of a corner case, since once the job is cancelled,
 and the new job starts the status will correctly be reported as 'pending'
 again.

Yeah, doesn't sound like a big deal. <knocks-on-wood/>

...
 Thanks,

 Kabir

 On Wed, 4 Sept 2024 at 18:00, Kabir Khan <kkhan(a)redhat.com&gt; wrote:

> I've implemented the space saving part, and I now think the tests can
> remain where they are.
>
> I found that with the Kubernetes registry enabled I was able to push and
> pull images from it. If I disable it and enable it again, the images I
> pushed before restarting are no longer there. So it seems this cleans up
> the registry, and should give a big space saving.
>
> I added a kubernetes-ci profile used by the GitHub Actions workflow,
> which enables the registry before each test is run, and disables it after
> it is run [1]. Here I also clean the image for each test from the local
> docker registry, although here the space saving is less (I believe it is
> just a layer containing the test deployment on top of the pre-built server
> images).
>
> For now I am keeping the server images built early on via the -Pimages
> flag, since I think the space saving from pruning the Kubernetes repository
> should be good enough for now. If this turns out to be a problem if we ever
> get a lot more tests and server images, I think I can do something in the
> scripts called by the kubernetes-ci profile to build those on demand, and
> remove them after the tests have completed.
>
> The next step will be to look at the improved reporting back to the
> WIldFly PR I mentioned.
>
> [1] - https://github.com/wildfly-extras/wildfly-cloud-tests/pull/192
>
> On Thu, 29 Aug 2024 at 15:30, Brian Stansberry <
> brian.stansberry(a)redhat.com&gt; wrote:
>
>>
>>
>> On Thu, Aug 29, 2024 at 4:57 AM Kabir Khan <kkhan(a)redhat.com&gt; wrote:
>>
>>> Ok, I vaguely thought about that too...
>>>
>>> I can keep them in wildfly-extras for now, and improve the reporting as
>>> mentioned, and then look into how to deal with the space issue. I guess on
>>> the wildfly-extras side it will be a trigger job calling out to the other
>>> ones, so the overall status report probably will not be as tricky as I
>>> imagined.
>>>
>>
>> Ok, good.
>>
>> An overall CI execution for a PR takes about 4.5 hours, due to the
>> Windows jobs on TeamCity, so even if GH-action-based jobs ended up queuing
>> sometimes it's unlikely to delay the entire PR cycle. These jobs take about
>> 20 minutes and other ones we run should be faster. So really we shouldn't
>> block moving things to wildfly. But optimizing any jobs that run in the
>> wildfly GH org is important.
>>
>>
>>> On Wed, 28 Aug 2024 at 16:53, Brian Stansberry <
>>> brian.stansberry(a)redhat.com&gt; wrote:
>>>
>>>>
>>>>
>>>> On Wed, Aug 28, 2024 at 5:50 AM Kabir Khan <kkhan(a)redhat.com&gt;
wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> These tests need some modernisation, and there are two things in my
>>>>> opinion that need addressing.
>>>>>
>>>>> *1 Space issues*
>>>>> Recently we were running out of space when running these tests.
James
>>>>> fixed this by deleting the built WildFly, but when trying to
resurrect an
>>>>> old PR I had forgotten all about, we ran out of space again.
>>>>>
>>>>> I believe the issue is that the way the tests work at the moment,
>>>>> which is to:
>>>>> * Start minikube with the registry
>>>>> * Build all the test images
>>>>> * Run all the tests
>>>>>
>>>>> Essentially we end up building all the server images (different
>>>>> layers) before running the tests, which takes space, and then each
test
>>>>> installs the image into minikube's registry. Also, some tests
also install
>>>>> other images (e.g postgres, strimzi) into the minikube instance.
>>>>>
>>>>> My initial thought was that it would be good to build the server
>>>>> images more on demand, rather than before the tests, and to be able
to call
>>>>> 'docker system prune' now and again.
>>>>>
>>>>> However, this does not take into account the minikube registry,
which
>>>>> will also accumulate a lot of images. It will at least become
populated
>>>>> with the test images, I am unsure if it also becomes populated with
the
>>>>> images pulled from elsewhere (i.e. postgres, strimzi etc)?
>>>>>
>>>>> If `minikube addons disable registry` followed by a 'minikube
addons
>>>>> enable registry' deletes the registry contents from the disk,
having a hook
>>>>> to do that between each test could be something easy to look into.
Does
>>>>> anyone know if this is the case?
>>>>>
>>>>> An alternative could be to have one job building wildfly, and
>>>>> uploading the maven repository as an artifact, and then have separate
jobs
>>>>> to run each test (or perhaps set of tests requiring the same WildFly
server
>>>>> image). However, as this test is quite fiddly since it runs remotely,
I'm
>>>>> not sure how the reporting would look.
>>>>>
>>>>> *2 Pull request trigger*
>>>>> PRs in wildfly/wildfly execute a remote dispatch which results in
the
>>>>> job getting run in the wildfly-extras/wildfly-cloud-tests
repository.
>>>>>
>>>>> There is no reporting back from the
>>>>> wildfly-extras/wildfly-cloud-tests repository about the run id of
the
>>>>> resulting run.
>>>>>
>>>>> What I did when I implemented this was to have the calling
>>>>> wildfly/wildfly job wait and poll a branch in
>>>>> wildfly-extras/wildfly-cloud-tests for the results of the job (IIRC I
have
>>>>> a file with the triggering PR number). The job on the other side
would then
>>>>> write to this branch once the job is done. Which is all quite ugly!
>>>>>
>>>>> However, playing in other repositories, I found
>>>>> https://www.kenmuse.com/blog/creating-github-checks/. Basically this
>>>>> would result in
>>>>> * the WIldFly pull request trigger completing immediately once it
has
>>>>> done the remote dispatch
>>>>> * When the wildfly-cloud-tests job starts it will do a remote
>>>>> dispatch to wildfly, which will get picked up by a workflow which can
add a
>>>>> status check on the PR conversation page saying remote testing in
>>>>> wildfly-cloud-tests is in progres
>>>>> * Once the wildfly-cloud-tests job is done, it will do another
remote
>>>>> dispatch to wildfly, which will update the status check with
success/failure
>>>>>
>>>>> So we'd have two checks in the section rather than the current
one.
>>>>>
>>>>>
>>>>> *Other ideas*
>>>>> While writing the above, the following occurred to me.
>>>>>
>>>>> The reason for the split is that the cloud test framework is quite
>>>>> involved, and IMO does not belong in WildFly. So the remote dispatch
>>>>> approach was used.
>>>>>
>>>>> However, I wonder now if a saner approach would be to update the
>>>>> wildfly-cloud-tests workflow to be reusable so they can be used from
>>>>> WildFly?
>>>>>
>>>>> That would allow the tests, test framework etc., and the workflow to
>>>>> continue to live in wildfly-cloud-tests, while running in wildfly
itself.
>>>>> That should get rid of the remote dispatch issues, and make that side
of
>>>>> things simpler.
>>>>>
>>>>> It does not address the space issue, but I think if this approach
>>>>> works, it will be easier to deal with the space issue.
>>>>>
>>>>
>>>> A downside is that means the 3 actual test jobs (e.g.
>>>>
https://github.com/wildfly-extras/wildfly-cloud-tests/actions/runs/105839...)
>>>> run using the wildfly GH org's set of runners.
>>>>
>>>> Relying on wildfly-extras to get around that is a hack though. But if
>>>> we're going to move these I think we need to optimize as much as
possible,
>>>> e.g. not rebuild WildFly multiple times.
>>>>
>>>>>
>>>>>
>>>>> Any thoughts/insights are welcome.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Kabir
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> wildfly-dev mailing list -- wildfly-dev(a)lists.jboss.org
>>>>> To unsubscribe send an email to wildfly-dev-leave(a)lists.jboss.org
>>>>> Privacy Statement: https://www.redhat.com/en/about/privacy-policy
>>>>> List Archives:
>>>>>
https://lists.jboss.org/archives/list/wildfly-dev@lists.jboss.org/message...
>>>>>
>>>>
>>>>
>>>> --
>>>> Brian Stansberry
>>>> Principal Architect, Red Hat JBoss EAP
>>>> WildFly Project Lead
>>>> He/Him/His
>>>>
>>>
>>
>> --
>> Brian Stansberry
>> Principal Architect, Red Hat JBoss EAP
>> WildFly Project Lead
>> He/Him/His
>>
> 
-- 
Brian Stansberry
Principal Architect, Red Hat JBoss EAP
WildFly Project Lead
He/Him/His

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

[wildfly-dev] Re: WildFly Cloud Tests