[wildfly-dev] Error reporting on deployment failure
Stuart Douglas
stuart.w.douglas at gmail.com
Thu Feb 15 18:15:24 EST 2018
I have opened https://github.com/wildfly/wildfly-core/pull/3114 to allow
for testing/further review.
Stuart
On Thu, Feb 15, 2018 at 11:32 PM, Stuart Douglas <stuart.w.douglas at gmail.com
> wrote:
>
>
> On Thu, Feb 15, 2018 at 6:51 PM, Brian Stansberry <
> brian.stansberry at redhat.com> wrote:
>
>> On Wed, Feb 14, 2018 at 9:37 PM, Stuart Douglas <
>> stuart.w.douglas at gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Feb 14, 2018 at 4:43 PM, Brian Stansberry <
>>> brian.stansberry at redhat.com> wrote:
>>>
>>>> On Tue, Feb 13, 2018 at 8:24 PM, Stuart Douglas <
>>>> stuart.w.douglas at gmail.com> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I have been thinking a bit about the way we report errors in WildFly,
>>>>> and I think this is something that we can improve on. At the moment I think
>>>>> we are way to liberal with what we report, which results in a ton of
>>>>> services being listed in the error report that have nothing to do with the
>>>>> actual failure.
>>>>>
>>>>> As an example to work from I have created [1], which is a simple EJB
>>>>> application. This consists of 10 EJB's, one of which has a reference to a
>>>>> non-existant data source, the rest are simply empty no-op EJB's (just
>>>>> @Stateless on an empty class).
>>>>>
>>>>> This app fails to deploy because the java:global/NonExistant data
>>>>> source is missing, which gives the failure description in [2]. This is ~120
>>>>> lines long and lists multiple services for every single component in the
>>>>> application (part of the reason this is so long is because the failures are
>>>>> reported twice, once when the deployment fails and once when the server
>>>>> starts).
>>>>>
>>>>> I think we can improve on this. I think in every failure case there
>>>>> will be some root causes that are all the end user cares about, and we
>>>>> should limit our reporting to just these cases, rather than listing every
>>>>> internal service that can no longer start due to missing transitive deps.
>>>>>
>>>>> In particular these root causes are:
>>>>> 1) A service threw and exception in its start() method and failed to
>>>>> start
>>>>> 2) A dependency is actually missing (i.e. not installed, not just not
>>>>> started)
>>>>>
>>>>> I think that one or both of these two cases will be the root cause of
>>>>> any failure, and as such that is all we should be reporting on.
>>>>>
>>>>> We already do an OK job of handing case 1), services that have failed,
>>>>> as they get their own line item in the error report, however case 2)
>>>>> results in a huge report that lists every service that has not come up, no
>>>>> matter how far removed they are from the actual problem.
>>>>>
>>>>
>>>> If the 2) case can be correctly determined, then +1 to reporting some
>>>> new section and not reporting the current "WFLYCTL0180: Services with
>>>> missing/unavailable dependencies" section. The WFLYCTL0180 section could
>>>> only be reported as a fallback if for some reason the 1) and 2) stuff is
>>>> empty.
>>>>
>>>
>>> I have adjusted this a bit so a service with mode NEVER is treated the
>>> same as if it is missing. I am pretty sure that with this change 1) and 2)
>>> will cover 100% of cases.
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> I think we could make a change to the way this is reported so that
>>>>> only direct problems are reported [3], so the error report would look
>>>>> something like [4] (note that this commit only changes the operation
>>>>> report, the container state reporting after boot is still quite verbose).
>>>>>
>>>>
>>>> I think the container state reporting is ok. IMHO the proper fix to the
>>>> container state reporting is to rollback and fail boot if Stage.RUNTIME
>>>> failures occur. Configurable, but rollback by default. If we did that there
>>>> would be no container state reporting. If you deploy your broken app
>>>> post-boot you shouldn't see the container state reporting because by the
>>>> time the report is written the op should have rolled back and the services
>>>> are no longer "missing". It's only because we don't rollback on boot that
>>>> this is reported.
>>>>
>>>
>>> I don't think it is nessesary to report on services that are only down
>>> because their dependents are down. It basically just adds noise, as they
>>> are not really related to the underlying issue. I have expanded my branch
>>> to also do this:
>>>
>>> https://github.com/wildfly/wildfly-core/compare/master...stu
>>> artwdouglas:error-reporting?expand=1
>>>
>>> This ends up with very concise reports that just detail the services
>>> that are the root cause of the problem: https://gist.github.c
>>> om/stuartwdouglas/42a68aaaa130ceee38ca5f66d0040de3
>>>
>>> Does this approach seem reasonable? lf a user really does want a
>>> complete dump of all services that are down that information is still
>>> available directly from MSC anyway.
>>>
>>
>> It seems reasonable.
>>
>> I'm going to get all lawyerly now. This is because while we don't treat
>> our failure messages as "API" requiring compatibility, for these particular
>> ones I think we should be as careful as possible.
>>
>> 1) "WFLYCTL0180: Services with missing/unavailable dependencies" => ["
>> jboss.naming.context.java.comp.\"error-reporting-1.0-SNAPS
>> HOT\".\"error-reporting-1.0-SNAPSHOT\".ErrorEjb.env.\"com.
>> stuartdouglas.ErrorEjb\".nonExistant is missing
>> [jboss.naming.context.java.global.NonExistant]"]
>>
>> Here you've somewhat repurposed an existing message. That can be quite ok
>> IMHO so long as what's gone is just noise and the English meaning of the
>> message is still correct. Basically, what did "missing/unavailable
>> dependencies" mean before, what does it mean now, and is there a clear
>> story behind the shift from A to B. The "missing" part is pretty clear --
>> not installed or NEVER is "missing". For "unavailable" now we've dropped
>> the installed but unstarted ones. If we're including the ones that directly
>> depend on *failed* services then that's a coherent definition of
>> "unavailable". If we're not then "unavailable" is misleading. Sorry, I'm
>> juggling stuff so I haven't checked the code. :(
>>
>
> Previously this section would display every service that was down due to
> its dependencies being down. This would include services that were many
> levels away from the actual problem (e.g. if A depends on B which depends
> on C which depends on D which is down, A, B and C would all be listed in
> this section). This change displays the same information, but only for
> direct dependents, so in the example about only C would be listed in this
> section.
>
> The 'New missing/unsatisfied dependencies:' section in the container state
> report is similar. Previously it would list every service that had failed
> to come up, now it will only list services that are directly affected by a
> problem.
>
>
>>
>> 2) I think "38 additional services are down due to their dependencies
>> being missing or failed" should have a message code, not NONE. It's a
>> separate message that may or may not appear. Plus it's new. And I think
>> we're better off in these complex message structures to be precise vs
>> trying to avoid codes for cosmetic reasons.
>>
>
> Ok.
>
> Stuart
>
>
>>
>>
>>
>>> Stuart
>>>
>>>
>>>>
>>>>>
>>>>> I am guessing that this is not as simple as it sounds, otherwise it
>>>>> would have already been addressed, but I think we can do better that the
>>>>> current state of affairs so I thought I would get a discussion started.
>>>>>
>>>>
>>>> It sounds pretty simple. Any "problem" ServiceController exposes its
>>>> ServiceContainer, and if relying on that registry to check if a missing
>>>> dependency is installed is not correct for some reason, the
>>>> ModelControllerImpl exposes its ServiceRegistry via a package protected
>>>> getter. So AbstractOperationContext can provide that to the SVH.
>>>>
>>>>
>>>>> Stuart
>>>>>
>>>>> [1] https://github.com/stuartwdouglas/errorreporting
>>>>> [2] https://gist.github.com/stuartwdouglas/b52a85813913f3304301e
>>>>> eb1f389fae8
>>>>> [3] https://github.com/stuartwdouglas/wildfly-core/commit/a1
>>>>> fbc831edf290971d54c13dd1c5d15719454f85
>>>>> [4] https://gist.github.com/stuartwdouglas/14040534da8d07f93
>>>>> 7d02f2f08099e8d
>>>>>
>>>>> _______________________________________________
>>>>> wildfly-dev mailing list
>>>>> wildfly-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Brian Stansberry
>>>> Manager, Senior Principal Software Engineer
>>>> Red Hat
>>>>
>>>
>>>
>>
>>
>> --
>> Brian Stansberry
>> Manager, Senior Principal Software Engineer
>> Red Hat
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/wildfly-dev/attachments/20180216/b3d9b184/attachment-0001.html
More information about the wildfly-dev
mailing list