Availability metrics: aggregate stats series
by Joel Takvorian
I'm still aiming to add some features to the grafana plugin. I've started
to integrate availabilities, but now I'm facing a problem when it comes to
show aggregated availabilities ; for example think about an OpenShift pod
that is scaled up to several instances.
Since availability is basically "up" or "down" (or, to simplify with the
other states such as "unknown", say it's either "up" or "non-up"), I
propose to add this new feature: availability stats with aggregation. The
call would be parameterized with an aggregation method, which would be
either "all of" or "any of": with "all of" we consider that the aggregated
series is UP when all its parts are UP.
It would require a new endpoint since the AvailabilityHandler currently
only expose stats queries with metric id as query parameter - not suitable
for multiple metrics.
Any objection or remark for this feature?
6 years, 6 months
[Inventory] Performance of Tinkerpop3 backends
by Lukas Krejci
to move inventory forward, we need to port it to Tinkerpop3 - a new(ish) and
actively maintained version of the Tinkerpop graph API.
Apart from the huge improvement in the API expressiveness and capabilities,
the important thing is that it comes with a variety of backends, 2 of which
are of particular interest to us ATM. The Titan backend (with Titan in version
1.0) and SQL backend (using the sqlg library).
The SQL backend is a much improved (yet still unfinished in terms of
optimizations and some corner case features) version of the toy SQL backend
Back in March I ran performance comparisons for SQL/postgres and Titan (0.5.4)
on Tinkerpop2 and concluded that Titan was the best choice then.
After completing a simplistic port of inventory to Tinkerpop3 (not taking
advantage of any new features or opportunities to simplify inventory
codebase), I've run the performance tests again for the 2 new backends - Titan
1.0 and Sqlg (on postgres).
This time the results are not so clear as the last time.
>From the charts  you can see that Postgres is actually quite a bit faster
on reads and can better handle concurrent read access while Titan shines in
writes (arguably thanks to Cassandra as its storage).
Of course, I can imagine that the read performance advantage of Postgres would
decrease with the growing amount of data stored (the tests ran with the
inventory size of ~10k entities) but I am quite positive we'd get competitive
read performance from both solutions up to the sizes of inventory we
anticipate (100k-1M entities).
Now the question is whether the insert performance is something we should be
worried about in Postgres too much. IMHO, there should be some room for
improvement in Sqlg and also our move to /sync for agent synchronization would
make this less of a problem (because there would be not that many initial
imports that would create vast amounts of entities).
Nevertheless I currently cannot say who is the "winner" here. Each backend has
its pros and cons:
- high write throughput
- backed by cassandra
- slower reads
- project virtually dead
- complex codebase (self-made fixes unlikely)
- small codebase
- everybody knows SQL
- faster reads
- faster concurrent reads
- slow writes
- another backend needed (Postgres)
Therefore my intention here is to go forward with a "proper" port to
Tinkerpop3 with Titan still enabled but focus primarily on Sqlg to see if we
can do anything with the write performance.
IMHO, any choice we make is "workable" as it is even today but we need to
weigh in the productization requirements. For those Sqlg with its small dep
footprint and postgres backend seems preferable to the huge dependency mess of
6 years, 6 months
Hawkular APM version 0.10.0.Final released with Zipkin integration
by Gary Brown
We are pleased to announce the release of version 0.10.0 of the Hawkular APM project.
The release notes can be found here: https://github.com/hawkular/hawkular-apm/releases/tag/0.10.0.Final
The highlights for this release are:
* In addition to the existing aggregated view of service invocations, it is now possible to view the list of trace instances for that (potentially filtered) aggregated view, and then select an individual instance to display the end to end call trace for that instance. (Screenshots available in the release notes).
* Zipkin integration. It is now possible to point applications, instrumented using zipkin compliant libraries, to the Hawkular APM server and have their information processed and visually represented in the UI.
Feedback on the new features would be very welcome!
Hawkular APM Team
6 years, 7 months
metrics on the bus
by Jay Shaughnessy
Lucas and I were talking over jira  which has to do with
metrics/alerting scale. This was discussed a bit on IRC recently as
well. Today, metrics publishes all datapoints to the bus (metrics and
avail go to different topics). The only consumer of that data is
alerting, and it consumes a small fraction of the total data (actually
it consumes none of it OOB at the moment, but that will hopefully change
as Lucas's alerting work comes on line in MIQ).
Although in its purest form this publish-it-all is the essence of bus
publishing, we both feel it's an unnecessary waste of resources, as
metrics can reach very high volume. There are a few approaches to
reducing the publishing/filtering that we're currently doing. The
options we discussed boil down to:
* No Publishing
o Just query metrics for the data needed for alerting (or whatever
other external use we may have for the data)
o This is essentially a polling approach with frequent polling
* Demand Publishing
o The "just tell me what movie you want to see" approach
o Let clients request the metric ids it wants published to the bus
I'm purposefully not going into much detail at this point. I'd rather
we talk out a preferred approach between these two, or something not
presented. But we'd like to move away from the current publish-it-all
6 years, 7 months
change notification: parameter metadata is in a different place
by John Mazzitelli
Recently, we added the ability for the Hawkular WildFly Agent to advertise what parameters can get passed to an operation by storing parameter metadata in the operation type's general properties.
However, Hawkular-Inventory provides an "official" place to store this data. Rather than have general properties host this metadata, H-Inventory wants parameters stored in a child data entity called "parameterTypes" under the operation type. 
The agent now does it this "official" way. However, to avoid clients from breaking before they can fix themselves and get parameters from this new location, the agent retains the original parameter metadata in general properties as well.
But of course we do not want the agent to store copies of the same metadata in two different locations in Hawkular-Inventory. So a JIRA  has been created to remove the parameters from general properties - a PR has been submitted and is able to be merged .
So, clients that look up operation parameters inside H-Inventory need to look at the parameterTypes data entity child and NOT look for them in general properties. If you have a client (MiQ?, Ruby gem?, Hawkfx?) that obtains parameter metadata from operation types' general properties, it should be fixed because once this PR is merged, parameter information will no longer exist in general properties.
-- John Mazz
 This is what the "official" parameters types entity looks like - parameters are stored in a data entity child called "parameterTypes" under the operation type - this example shows the parameters for the WildFly Server's "Shutdown" operation (the parameters are "timeout" and "restart"):
"description": "Timeout in seconds to allow active connections to drain",
"description": "Should the server be restarted after shutdown?",
6 years, 7 months
services itest failure
by John Mazzitelli
Regarding the services itest failure with metric storage from the agent:
When running on my desktop, I don't even get that far - tests can't even start without timing out.
When running on my laptop (which is a faster machine with SSD), I confirmed the failure (ran it 2 times, failed the same way both times).
I changed this line:
and ran it several times, all times it successfully passed.
So looks like a timing issue - the test just didn't wait long enough for the data to come in (which would explain the 204 it was getting). Moving it from 5s to 50s max timeout solved it for me.
6 years, 7 months
Integration tests - Improvements
by Lucas Ponce
Today we have a situation with the integration tests that is starting to be a concern.
Basically, there is a lack of integration testing that makes some potential bugs escapes preliminary controls and it is showed in the ManageIQ.
I guess we could increase the integration test coverage overall and define some additional processes to improve this.
These are some ideas commented with the team:
- Today, we have some coverage per component, but not at hawkular-services level. It could be good if we can add more end-to-end tests at hawkular-services level.
- Before a component is released, it could be good if component can test the hawkular-services integration suite to validate that there is nothing broken, or if it is, start a discussion to get a consensus/tradeoff. Probably from isolated component view everything is ok, but in the hawkular-services context it could be a potential problem. (Not sure if the CI can help us here, extending running the new component against the hawkular-services itests).
- This also happens with the clients, in the hawkular-client-ruby, we have a recorded set of http calls (VCR cassettes). This works fine and it is great, but today we have some mix of versions recorded (at least for the hawkular-services part). So, it would be great if that can be normalized. For example, I think that for a new release of the hawkular-services version, the CI could run the tests to validate if the recording are still valid or not.
- The same situation happens at ManageIQ side, there are recordings of VCR cassettes for different versions of hawkular-services (we have started to annotate the version), but it happens that the tests can pass but that doesnt mean that the last hawkular-services version will pass, so an action tasks could be that if the hawkular-client-ruby version is changed, VCR tests could be re-recorded against for version validated for hawkular-services.
These are just some ideas to start discussing.
I guess that the CI (travis or internally using torii) can help in some degree, but I have lack of experience with them.
In any case, the goal of this is not slow any development, but just to have an early indicator that if there is some change in the component, the overall system is notified and we can address it better than if this is happening during a demo/final stage of a QE task, for example.
Any thoughts about these ideas are welcome.
6 years, 7 months