I'm attempting to fix
http://jira.rhq-project.org/browse/RHQ-1930.
I need my latest checkin (rev 3634) peer reviewed and tested. I ran it over here and it
looks like its working (even stepped thru a debugger to see it working).
I basically wrap calls to the discovery components in a proxy that times out the
invocations if the component takes too long (30 seconds hardcoded - yes, it needs to be
configurable, that would be nice :)
Do you see anything missing?
There is one thing I already know about: the thread pool can grow unbounded if a discovery
component is misbehaved. In other words, if a discovery component's discoverResource
method consistently deadlocks or enters an infinite loop, our thread pool will grow
unbounded until we run out of mem (i.e. if we run discovery every 15 minutes, then over a
span of one day, 96 deadlocked threads will have been created but not terminated ... 10
days == 960 threads, etc. etc. This is why I think I need to add the ability for the agent
to disable a misbehaving discovery component. Something like, "if a discovery
component times out N times, it should be disabled and never invoked again".
Thoughts?