Thanks for the update, Kabir!
On Fri, Oct 18, 2024 at 10:23 AM Kabir Khan <kkhan(a)redhat.com> wrote:
If the wildfly-extras/wildfly-cloud-tests job fails that is reported
in
the 'Cloud Tests Report Run' job on the pull request.
However, in order to rerun it, go to the details of the 'Cloud Tests
Trigger' job in the pull request, and select the option to rerun the job
(hopefully non admin see this option).
This will then run in wildfly/wildfly, re-calculate the required
information, and send that across to wildfly-extras/wildfly-cloud-tests via
a remote repository dispatch, so the remote job should have the proper
information.
On Tue, 1 Oct 2024 at 17:51, Kabir Khan <kkhan(a)redhat.com> wrote:
> Martin was working on something last week, which demonstrated that we can
> still run out of memory if we have a lot of HUGE images.
>
> So following up on the previous work where I cleared out the Kubernetes
> registry between tests by disabling and enabling it, I now do the same with
> the local Docker registry.
>
>
https://github.com/wildfly-extras/wildfly-cloud-tests/pull/209
>
> Local development should work the same as it always did, where we:
> * Build all the images in the images/ folder
> * Run the tests using those images (which in turn creates an image per
> test)
>
> There is a small change, that to get the new way outlined below working I
> had to remove the Maven dependencies from the tests on the image modules.
> However, looking at the "Reactor Sorting" section in
>
https://maven.apache.org/guides/mini/guide-multiple-modules.html if you
> want to both build images and run tests in one command it looks like
>
> mvn -Pimages install
>
>
> will still work, since in the root pom images are listed before tests.
>
> On CI, I've changed things so that the 'Build all the images' step no
> longer happens.
>
> I've added some scripts to deal with the images.
>
> Before each test, we determine the Maven module from the images/ folder
> that provides the image needed for the test, and run a Maven build of that
> which makes it available in the local Docker repository. After that the
> test image (base image + deployment) can be built, exactly the way it has
> been working up to now.
>
> After the test is complete, we delete the test image and the base image,
> and move on to the next test.
>
> Now, we avoid both the Kubernetes and Docker registries getting bigger
> and bigger after each test.
>
> Thanks,
>
> Kabir
>
> On Mon, 9 Sept 2024 at 22:17, Brian Stansberry <
> brian.stansberry(a)redhat.com> wrote:
>
>>
>>
>> On Mon, Sep 9, 2024 at 3:36 PM Kabir Khan <kkhan(a)redhat.com> wrote:
>>
>>> Yeah we have three jobs running in parallel in the remote repository,
>>> although there could be more in the future, which the blocking job used to
>>> wait for.
>>>
>>> But IMO as long as the reporting stuff I added isn’t too fragile I
>>> actually prefer having them running in the cloud tests repo. My main
>>> concern was if we needed to split into more jobs to save space but I don’t
>>> think that is needed. And getting more familiar with this again, we’d just
>>> handle that the same way we do now with the cloud test reporter job waiting
>>> for the ones running the tests.
>>>
>>
>> +1
>>
>>
>>> I forgot to mention there is a PAT needed with the repository
>>> permission for the dispatches. That is stored as a secret in both repos,
>>> and used by the parts of the scripts doing the remote repository dispatch.
>>>
>>> On Fri, 6 Sept 2024 at 18:41, Brian Stansberry <
>>> brian.stansberry(a)redhat.com> wrote:
>>>
>>>> Nice!
>>>>
>>>> If we ever revisit moving these from wildfly-extras, it occurs to me
>>>> that while that would add 3 more relatively long-running jobs per PR,
what
>>>> you've done here eliminates one: the current job that does the
polling. So
>>>> it's a net add of only 2 jobs, not 3.
>>>>
>>>> On Fri, Sep 6, 2024 at 8:07 AM Kabir Khan <kkhan(a)redhat.com>
wrote:
>>>>
>>>>> I have improved the error reporting mechanism now.
>>>>>
>>>>> I did this in two phases
>>>>> *
https://github.com/wildfly/wildfly/pull/18172 +
>>>>>
https://github.com/wildfly-extras/wildfly-cloud-tests/pull/194
>>>>> introduced a new mechanism
>>>>> *
https://github.com/wildfly/wildfly/pull/18174 +
>>>>>
https://github.com/wildfly-extras/wildfly-cloud-tests/pull/195
>>>>> removed the old mechanism
>>>>>
>>>>> Previously the Cloud Tests Trigger workflow triggered by WIldFly PRs
>>>>> would wait until the test on the wildfly-cloud-tests side had
completed.
>>>>> The remote job would communicate the status of the job via a push to
a
>>>>> branch that the trigger was polling and monitoring for the commit
with the
>>>>> status. I was never happy with this approach, and came across another
way
>>>>> while adding CI somewhere else.
>>>>>
>>>>> What happens now is the Cloud Tests Trigger issues a repository
>>>>> dispatch against the WIldFly Cloud Tests repository. This is the same
as
>>>>> before, but now it returns immediately after the dispatch.
>>>>> The dispatch is done cloud-test-pr-trigger.yml workflow
>>>>>
<
https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;,
>>>>> and the *trigger-cloud-tests-pr* event is handled on the cloud tests
>>>>> side by the wildfly-pull-request-runner.yml workflow
>>>>>
<
https://github.com/wildfly-extras/wildfly-cloud-tests/blob/main/.github/w...
>>>>> .
>>>>>
>>>>> The first thing that the wildfly-pull-request-runner.yml workflow
>>>>> does is, is a repository dispatch back to the wildfly repository to
set the status
>>>>> of the job as pending
>>>>>
<
https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;.
>>>>> The *report-cloud-tests-pr-pending* event type is handled on the
>>>>> wildlfy side by cloud-test-pr-reporter.yml
>>>>>
<
https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;,
>>>>> which executes a call to add the status.
>>>>>
>>>>> Once this is done, in the original PR, we see the 'Cloud Tests
>>>>> Trigger' job has completed, and there is a new entry called
'Cloud Tests
>>>>> Remote Run', which is in the pending status:
>>>>> [image: Screenshot 2024-09-06 at 13.51.50.png]
>>>>>
>>>>> The 'Details' link for 'Cloud Tests Remote Run' takes
you to the
>>>>> workflow run on the wildfly-cloud-tests side.
>>>>>
>>>>> Once all the tests are run, the cloud tests
>>>>> wildfly-pull-request-runner.yml reports the job status back to
wildfly,
>>>>> with another repository dispatch
>>>>>
<
https://github.com/wildfly-extras/wildfly-cloud-tests/blob/main/.github/w...;.
>>>>> Again, the *report-cloud-tests-pr-complete* event type is handled on
>>>>> the wildfly side by cloud-test-pr-reporter.yml
>>>>>
<
https://github.com/wildfly/wildfly/blob/main/.github/workflows/cloud-test...;,
>>>>> which executes a call to update the status for the job on the PR. In
this
>>>>> case the job passed 🥳:
>>>>> [image: Screenshot 2024-09-06 at 14.45.28.png]
>>>>> As before the 'Details' link takes you back to the job run on
the
>>>>> cloud tests side.
>>>>>
>>>>> A small niggle is that the concurrency check on the WIldFly cloud
>>>>> tests side will cancel all running jobs.
>>>>>
>>>>
>>>> What's the concurrency check?
>>>>
>>>> Ah, as I write my brain guesses that it's the thing that happens if
>>>> the PR branch is pushed again while jobs for a previous push are still
>>>> running.
>>>>
>>>
>>> Yeah it cancels any in flight ones coming from the same repository
>>> (wildfly) and has the same pr number. This is by design, although maybe
>>> what I write sounded like it is a problem.
>>>
>>> The actual problem is that what I am doing in the reporter to check the
>>> status of each job falls over when the status of those jobs is cancelled
>>> rather than success/failure. I don’t think that will be hard to fix but
>>> decided to merge anyway since I don’t think it is a huge problem in
>>> practice :-)
>>>
>>>
>>>>
>>>> This currently causes the status reported back to be 'failed' due
to
>>>>> something I still need to figure out. Ideally that should be
cancelled.
>>>>> However, this is a bit of a corner case, since once the job is
cancelled,
>>>>> and the new job starts the status will correctly be reported as
'pending'
>>>>> again.
>>>>>
>>>>
>>>> Yeah, doesn't sound like a big deal. <knocks-on-wood/>
>>>>
>>> I've improved on this a little bit. Actually, 'cancelled'
isn't a valid
>>> status anyway, just 'pending', 'success' and 'error'
+ 'failure'. I've
>>> modified the end reporter to either report 'success' or
'failure'.
>>>
>>
>> Sounds good.
>>
>>
>>>>
>>>>> Thanks,
>>>>>
>>>>> Kabir
>>>>>
>>>>> On Wed, 4 Sept 2024 at 18:00, Kabir Khan <kkhan(a)redhat.com>
wrote:
>>>>>
>>>>>> I've implemented the space saving part, and I now think the
tests
>>>>>> can remain where they are.
>>>>>>
>>>>>> I found that with the Kubernetes registry enabled I was able to
push
>>>>>> and pull images from it. If I disable it and enable it again, the
images I
>>>>>> pushed before restarting are no longer there. So it seems this
cleans up
>>>>>> the registry, and should give a big space saving.
>>>>>>
>>>>>> I added a kubernetes-ci profile used by the GitHub Actions
workflow,
>>>>>> which enables the registry before each test is run, and disables
it after
>>>>>> it is run [1]. Here I also clean the image for each test from the
local
>>>>>> docker registry, although here the space saving is less (I
believe it is
>>>>>> just a layer containing the test deployment on top of the
pre-built server
>>>>>> images).
>>>>>>
>>>>>> For now I am keeping the server images built early on via the
>>>>>> -Pimages flag, since I think the space saving from pruning the
Kubernetes
>>>>>> repository should be good enough for now. If this turns out to be
a problem
>>>>>> if we ever get a lot more tests and server images, I think I can
do
>>>>>> something in the scripts called by the kubernetes-ci profile to
build those
>>>>>> on demand, and remove them after the tests have completed.
>>>>>>
>>>>>> The next step will be to look at the improved reporting back to
the
>>>>>> WIldFly PR I mentioned.
>>>>>>
>>>>>> [1] -
https://github.com/wildfly-extras/wildfly-cloud-tests/pull/192
>>>>>>
>>>>>> On Thu, 29 Aug 2024 at 15:30, Brian Stansberry <
>>>>>> brian.stansberry(a)redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 29, 2024 at 4:57 AM Kabir Khan
<kkhan(a)redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, I vaguely thought about that too...
>>>>>>>>
>>>>>>>> I can keep them in wildfly-extras for now, and improve
the
>>>>>>>> reporting as mentioned, and then look into how to deal
with the space
>>>>>>>> issue. I guess on the wildfly-extras side it will be a
trigger job calling
>>>>>>>> out to the other ones, so the overall status report
probably will not be as
>>>>>>>> tricky as I imagined.
>>>>>>>>
>>>>>>>
>>>>>>> Ok, good.
>>>>>>>
>>>>>>> An overall CI execution for a PR takes about 4.5 hours, due
to the
>>>>>>> Windows jobs on TeamCity, so even if GH-action-based jobs
ended up queuing
>>>>>>> sometimes it's unlikely to delay the entire PR cycle.
These jobs take about
>>>>>>> 20 minutes and other ones we run should be faster. So really
we shouldn't
>>>>>>> block moving things to wildfly. But optimizing any jobs that
run in the
>>>>>>> wildfly GH org is important.
>>>>>>>
>>>>>>>
>>>>>>>> On Wed, 28 Aug 2024 at 16:53, Brian Stansberry <
>>>>>>>> brian.stansberry(a)redhat.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 28, 2024 at 5:50 AM Kabir Khan
<kkhan(a)redhat.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> These tests need some modernisation, and there
are two things in
>>>>>>>>>> my opinion that need addressing.
>>>>>>>>>>
>>>>>>>>>> *1 Space issues*
>>>>>>>>>> Recently we were running out of space when
running these tests.
>>>>>>>>>> James fixed this by deleting the built WildFly,
but when trying to
>>>>>>>>>> resurrect an old PR I had forgotten all about, we
ran out of space again.
>>>>>>>>>>
>>>>>>>>>> I believe the issue is that the way the tests
work at the
>>>>>>>>>> moment, which is to:
>>>>>>>>>> * Start minikube with the registry
>>>>>>>>>> * Build all the test images
>>>>>>>>>> * Run all the tests
>>>>>>>>>>
>>>>>>>>>> Essentially we end up building all the server
images (different
>>>>>>>>>> layers) before running the tests, which takes
space, and then each test
>>>>>>>>>> installs the image into minikube's registry.
Also, some tests also install
>>>>>>>>>> other images (e.g postgres, strimzi) into the
minikube instance.
>>>>>>>>>>
>>>>>>>>>> My initial thought was that it would be good to
build the server
>>>>>>>>>> images more on demand, rather than before the
tests, and to be able to call
>>>>>>>>>> 'docker system prune' now and again.
>>>>>>>>>>
>>>>>>>>>> However, this does not take into account the
minikube registry,
>>>>>>>>>> which will also accumulate a lot of images. It
will at least become
>>>>>>>>>> populated with the test images, I am unsure if it
also becomes populated
>>>>>>>>>> with the images pulled from elsewhere (i.e.
postgres, strimzi etc)?
>>>>>>>>>>
>>>>>>>>>> If `minikube addons disable registry` followed by
a 'minikube
>>>>>>>>>> addons enable registry' deletes the registry
contents from the disk, having
>>>>>>>>>> a hook to do that between each test could be
something easy to look into.
>>>>>>>>>> Does anyone know if this is the case?
>>>>>>>>>>
>>>>>>>>>> An alternative could be to have one job building
wildfly, and
>>>>>>>>>> uploading the maven repository as an artifact,
and then have separate jobs
>>>>>>>>>> to run each test (or perhaps set of tests
requiring the same WildFly server
>>>>>>>>>> image). However, as this test is quite fiddly
since it runs remotely, I'm
>>>>>>>>>> not sure how the reporting would look.
>>>>>>>>>>
>>>>>>>>>> *2 Pull request trigger*
>>>>>>>>>> PRs in wildfly/wildfly execute a remote dispatch
which results
>>>>>>>>>> in the job getting run in the
wildfly-extras/wildfly-cloud-tests repository.
>>>>>>>>>>
>>>>>>>>>> There is no reporting back from the
>>>>>>>>>> wildfly-extras/wildfly-cloud-tests repository
about the run id of the
>>>>>>>>>> resulting run.
>>>>>>>>>>
>>>>>>>>>> What I did when I implemented this was to have
the calling
>>>>>>>>>> wildfly/wildfly job wait and poll a branch in
>>>>>>>>>> wildfly-extras/wildfly-cloud-tests for the
results of the job (IIRC I have
>>>>>>>>>> a file with the triggering PR number). The job on
the other side would then
>>>>>>>>>> write to this branch once the job is done. Which
is all quite ugly!
>>>>>>>>>>
>>>>>>>>>> However, playing in other repositories, I found
>>>>>>>>>>
https://www.kenmuse.com/blog/creating-github-checks/. Basically
>>>>>>>>>> this would result in
>>>>>>>>>> * the WIldFly pull request trigger completing
immediately once
>>>>>>>>>> it has done the remote dispatch
>>>>>>>>>> * When the wildfly-cloud-tests job starts it will
do a remote
>>>>>>>>>> dispatch to wildfly, which will get picked up by
a workflow which can add a
>>>>>>>>>> status check on the PR conversation page saying
remote testing in
>>>>>>>>>> wildfly-cloud-tests is in progres
>>>>>>>>>> * Once the wildfly-cloud-tests job is done, it
will do another
>>>>>>>>>> remote dispatch to wildfly, which will update the
status check with
>>>>>>>>>> success/failure
>>>>>>>>>>
>>>>>>>>>> So we'd have two checks in the section rather
than the current
>>>>>>>>>> one.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Other ideas*
>>>>>>>>>> While writing the above, the following occurred
to me.
>>>>>>>>>>
>>>>>>>>>> The reason for the split is that the cloud test
framework is
>>>>>>>>>> quite involved, and IMO does not belong in
WildFly. So the remote dispatch
>>>>>>>>>> approach was used.
>>>>>>>>>>
>>>>>>>>>> However, I wonder now if a saner approach would
be to update the
>>>>>>>>>> wildfly-cloud-tests workflow to be reusable so
they can be used from
>>>>>>>>>> WildFly?
>>>>>>>>>>
>>>>>>>>>> That would allow the tests, test framework etc.,
and the
>>>>>>>>>> workflow to continue to live in
wildfly-cloud-tests, while running in
>>>>>>>>>> wildfly itself. That should get rid of the remote
dispatch issues, and make
>>>>>>>>>> that side of things simpler.
>>>>>>>>>>
>>>>>>>>>> It does not address the space issue, but I think
if this
>>>>>>>>>> approach works, it will be easier to deal with
the space issue.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> A downside is that means the 3 actual test jobs
(e.g.
>>>>>>>>>
https://github.com/wildfly-extras/wildfly-cloud-tests/actions/runs/105839...)
>>>>>>>>> run using the wildfly GH org's set of runners.
>>>>>>>>>
>>>>>>>>> Relying on wildfly-extras to get around that is a
hack though.
>>>>>>>>> But if we're going to move these I think we need
to optimize as much as
>>>>>>>>> possible, e.g. not rebuild WildFly multiple times.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Any thoughts/insights are welcome.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Kabir
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> wildfly-dev mailing list --
wildfly-dev(a)lists.jboss.org
>>>>>>>>>> To unsubscribe send an email to
>>>>>>>>>> wildfly-dev-leave(a)lists.jboss.org
>>>>>>>>>> Privacy Statement:
>>>>>>>>>>
https://www.redhat.com/en/about/privacy-policy
>>>>>>>>>> List Archives:
>>>>>>>>>>
https://lists.jboss.org/archives/list/wildfly-dev@lists.jboss.org/message...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Brian Stansberry
>>>>>>>>> Principal Architect, Red Hat JBoss EAP
>>>>>>>>> WildFly Project Lead
>>>>>>>>> He/Him/His
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Brian Stansberry
>>>>>>> Principal Architect, Red Hat JBoss EAP
>>>>>>> WildFly Project Lead
>>>>>>> He/Him/His
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Brian Stansberry
>>>> Principal Architect, Red Hat JBoss EAP
>>>> WildFly Project Lead
>>>> He/Him/His
>>>>
>>>
>>
>> --
>> Brian Stansberry
>> Principal Architect, Red Hat JBoss EAP
>> WildFly Project Lead
>> He/Him/His
>>
>
--
Brian Stansberry
Principal Architect, Red Hat JBoss EAP
WildFly Project Lead
He/Him/His