Re: [wildfly-dev] Error reporting on deployment failure

Friday, 16 February 2018

Yes, we had some bz/jiras opened about this before.  We get cases where 
customers application is failing and has pages upon pages of dependency 
errors and the customer cannot easily determine the issue.  And even 
support has difficultly, we usually try searching for common things like 
datasources or other JNDI references that might be missing, but I have 
seen several where it was not a datasource and took a while of tearing 
the apps apart to resolve. It looks like there was some improvement in 
EAP 7.1 [2], but it sounds like Stuart's PR may be even better.

I found one example deployment on [1] that we could try and see what the 
logging looks like with the new PR.
I figure the service dump would show all of the failed dependencies in 
case there was a need to look at the others?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1283294
[2] https://issues.jboss.org/browse/JBEAP-5311

On 2/15/18 5:15 PM, Stuart Douglas wrote:
...
 I have opened https://github.com/wildfly/wildfly-core/pull/3114 to 
 allow for testing/further review.

 Stuart

 On Thu, Feb 15, 2018 at 11:32 PM, Stuart Douglas 
 <stuart.w.douglas(a)gmail.com <mailto:stuart.w.douglas@gmail.com>> wrote:

     On Thu, Feb 15, 2018 at 6:51 PM, Brian Stansberry
     <brian.stansberry(a)redhat.com <mailto:brian.stansberry@redhat.com>>
     wrote:

         On Wed, Feb 14, 2018 at 9:37 PM, Stuart Douglas
         <stuart.w.douglas(a)gmail.com
         <mailto:stuart.w.douglas@gmail.com>> wrote:

             On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry
             <brian.stansberry(a)redhat.com
             <mailto:brian.stansberry@redhat.com>> wrote:

                 On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas
                 <stuart.w.douglas(a)gmail.com
                 <mailto:stuart.w.douglas@gmail.com>> wrote:

                     Hi Everyone,

                     I have been thinking a bit about the way we report
                     errors in WildFly, and I think this is something
                     that we can improve on. At the moment I think we
                     are way to liberal with what we report, which
                     results in a ton of services being listed in the
                     error report that have nothing to do with the
                     actual failure.

                     As an example to work from I have created [1],
                     which is a simple EJB application. This consists
                     of 10 EJB's, one of which has a reference to a
                     non-existant data source, the rest are simply
                     empty no-op EJB's (just @Stateless on an empty class).

                     This app fails to deploy because the
                     java:global/NonExistant data source is missing,
                     which gives the failure description in [2]. This
                     is ~120 lines long and lists multiple services for
                     every single component in the application (part of
                     the reason this is so long is because the failures
                     are reported twice, once when the deployment fails
                     and once when the server starts).

                     I think we can improve on this. I think in every
                     failure case there will be some root causes that
                     are all the end user cares about, and we should
                     limit our reporting to just these cases, rather
                     than listing every internal service that can no
                     longer start due to missing transitive deps.

                     In particular these root causes are:
                     1) A service threw and exception in its start()
                     method and failed to start
                     2) A dependency is actually missing (i.e. not
                     installed, not just not started)

                     I think that one or both of these two cases will
                     be the root cause of any failure, and as such that
                     is all we should be reporting on.

                     We already do an OK job of handing case 1),
                     services that have failed, as they get their own
                     line item in the error report, however case 2)
                     results in a huge report that lists every service
                     that has not come up, no matter how far removed
                     they are from the actual problem.

                 If the 2) case can be correctly determined, then +1 to
                 reporting some new section and not reporting the
                 current "WFLYCTL0180: Services with
                 missing/unavailable dependencies" section. The
                 WFLYCTL0180 section could only be reported as a
                 fallback if for some reason the 1) and 2) stuff is empty.

             I have adjusted this a bit so a service with mode NEVER is
             treated the same as if it is missing. I am pretty sure
             that with this change 1) and 2) will cover 100% of cases.

                     I think we could make a change to the way this is
                     reported so that only direct problems are reported
                     [3], so the error report would look something like
                     [4] (note that this commit only changes the
                     operation report, the container state reporting
                     after boot is still quite verbose).

                 I think the container state reporting is ok. IMHO the
                 proper fix to the container state reporting is to
                 rollback and fail boot if Stage.RUNTIME failures
                 occur. Configurable, but rollback by default. If we
                 did that there would be no container state reporting.
                 If you deploy your broken app post-boot you shouldn't
                 see the container state reporting because by the time
                 the report is written the op should have rolled back
                 and the services are no longer "missing". It's only
                 because we don't rollback on boot that this is reported.

             I don't think it is nessesary to report on services that
             are only down because their dependents are down. It
             basically just adds noise, as they are not really related
             to the underlying issue. I have expanded my branch to also
             do this:

https://github.com/wildfly/wildfly-core/compare/master...stuartwdouglas:e...

<https://github.com/wildfly/wildfly-core/compare/master...stuartwdouglas:e...
             This ends up with very concise reports that just detail
             the services that are the root cause of the problem:
             https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3

<https://gist.github.com/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3>

             Does this approach seem reasonable? lf a user really does
             want a complete dump of all services that are down that
             information is still available directly from MSC anyway.

         It seems reasonable.

         I'm going to get all lawyerly now. This is because while we
         don't treat our failure messages as "API" requiring
         compatibility, for these particular ones I think we should be
         as careful as possible.

         1)  "WFLYCTL0180: Services with missing/unavailable
         dependencies" => ["jboss.naming.context.java.co

<http://jboss.naming.context.java.co>mp.\"error-reporting-1.0-SNAP...
         is missing [jboss.naming.context.java.global.NonExistant]"]

         Here you've somewhat repurposed an existing message. That can
         be quite ok IMHO so long as what's gone is just noise and the
         English meaning of the message is still correct. Basically,
         what did "missing/unavailable dependencies" mean before, what
         does it mean now, and is there a clear story behind the shift
         from A to B.  The "missing" part is pretty clear -- not
         installed or NEVER is "missing". For "unavailable" now
we've
         dropped the installed but unstarted ones. If we're including
         the ones that directly depend on *failed* services then that's
         a coherent definition of "unavailable". If we're not then
         "unavailable" is misleading. Sorry, I'm juggling stuff so I
         haven't checked the code. :(

     Previously this section would display every service that was down
     due to its dependencies being down. This would include services
     that were many levels away from the actual problem (e.g. if A
     depends on B which depends on C which depends on D which is down,
     A, B and C would all be listed in this section). This change
     displays the same information, but only for direct dependents, so
     in the example about only C would be listed in this section.

     The 'New missing/unsatisfied dependencies:' section in the
     container state report is similar. Previously it would list every
     service that had failed to come up, now it will only list services
     that are directly affected by a problem.

         2) I think "38 additional services are down due to their
         dependencies being missing or failed" should have a message
         code, not NONE. It's a separate message that may or may not
         appear. Plus it's new. And I think we're better off in these
         complex message structures to be precise vs trying to avoid
         codes for cosmetic reasons.

     Ok.

     Stuart

             Stuart

                     I am guessing that this is not as simple as it
                     sounds, otherwise it would have already been
                     addressed, but I think we can do better that the
                     current state of affairs so I thought I would get
                     a discussion started.

                 It sounds pretty simple. Any "problem"
                 ServiceController exposes its ServiceContainer, and if
                 relying on that registry to check if a missing
                 dependency is installed is not correct for some
                 reason, the ModelControllerImpl exposes its
                 ServiceRegistry via a package protected getter. So
                 AbstractOperationContext can provide that to the SVH.

                     Stuart

                     [1]
                     https://github.com/stuartwdouglas/errorreporting
                     <https://github.com/stuartwdouglas/errorreporting>
                     [2]

https://gist.github.com/stuartwdouglas/b52a85813913f3304301eeb1f389fae8

<https://gist.github.com/stuartwdouglas/b52a85813913f3304301eeb1f389fae8>

                     [3]

https://github.com/stuartwdouglas/wildfly-core/commit/a1fbc831edf290971d5...

<https://github.com/stuartwdouglas/wildfly-core/commit/a1fbc831edf290971d5...
                     [4]

https://gist.github.com/stuartwdouglas/14040534da8d07f937d02f2f08099e8d

<https://gist.github.com/stuartwdouglas/14040534da8d07f937d02f2f08099e8d>

                     _______________________________________________
                     wildfly-dev mailing list
                     wildfly-dev(a)lists.jboss.org
                     <mailto:wildfly-dev@lists.jboss.org>
                     https://lists.jboss.org/mailman/listinfo/wildfly-dev
                     <https://lists.jboss.org/mailman/listinfo/wildfly-dev>

                 -- 
                 Brian Stansberry
                 Manager, Senior Principal Software Engineer
                 Red Hat

         -- 
         Brian Stansberry
         Manager, Senior Principal Software Engineer
         Red Hat

 _______________________________________________
 wildfly-dev mailing list
 wildfly-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/wildfly-dev 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [wildfly-dev] Error reporting on deployment failure