RxJava2 preliminary testing
by Michael Burman
Hi,
I did yesterday evening and today some testing on how using RxJava2
would benefit us (I'm expecting more from RxJava 2.1 actually, since it
has some enhanced parallelism features which we might benefit from).
Short notes from RxJava2 migration, it's more painful than I assumed.
The code changes can be small in terms of lines of code changed, but
almost every method has had their signature or behavior changed. So at
least I've had to read the documentation all the time when doing things
and trying to unlearn what I've done in the RxJava1.
And all this comes with a backwards compatibility pressure for Java 6
(so you can't benefit from many Java 8 advantages). Reactive-Commons /
Reactor have started from Java 8 to provide cleaner implementation. Grr.
I wrote a simple write path modification in the PR #762 (metrics) that
writes Gauges using RxJava2 ported micro-batching feature. There's still
some RxJavaInterOp use in it, so that might slow down the performance a
little bit. However, it is possible to merge these two codes. There are
also some other optimizations I think could be worth it.
I'd advice against it though, reading gets quite complex. I would almost
suggest that we would do the MetricsServiceImpl/DataAccessImpl merging
by rewriting small parts at a time in the new class with RxJava2 and
make that call the old code with RxJavaInterOp. That way we could move
slowly to the newer codebase.
I fixed the JMH-benchmarks (as they're not compiled in our CI and were
actually broken by some other PRs) and ran some tests. These are the
tests that measure only the metrics-core-service performance and do not
touch the REST-interface (or Wildfly) at all, thus giving better
comparison in how our internal changes behave.
What I'm seeing is around 20-30% difference in performance when writing
gauges this way. So this should offset some of the issues we saw when we
improved error handling (which caused performance degradation). I did
ran into the HWKMETRICS-542 (BusyPoolException) so the tests were run
with 1024 connections.
I'll continue next week some more testing, but at the same time I proved
that the micro-batching features do improve performance in the internal
processing, especially when there's small amount of writers to a single
node. But testing those features could probably benefit from more
benchmark tests without WIldfly (which takes so much processing power
that most performance improvements can't be measured correctly anymore).
- Micke
7 years, 2 months
HOSA and conversion from prometheus to hawkular metrics
by John Mazzitelli
The past several days I've been working on an enhancement to HOSA that came in from the community (in fact, I would consider it a bug). I'm about ready to merge the PR [1] for this and do a HOSA 1.1.0.Final release. I wanted to post this to announce it and see if there is any feedback, too.
Today, HOSA collects metrics from any Prometheus endpoint which you declare - example:
metrics
- name: go_memstats_sys_bytes
- name: process_max_fds
- name: process_open_fds
But if a Prometheus metric has labels, Prometheus itself considers each metric with a unique combination of labels as an individual time series metric. This is different than how Hawkular Metric works - each Hawkular Metric metric ID (even if its metric definition or its datapoints have tags) is a single time series metric. We need to account for this difference. For example, if our agent is configured with:
metrics:
- name: jvm_memory_pool_bytes_committed
And the Prometheus endpoint emits that metric with a label called "pool" like this:
jvm_memory_pool_bytes_committed{pool="Code Cache",} 2.7787264E7
jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 2.3068672E7
then to Prometheus this is actually 2 time series metrics (the number of bytes committed per pool type), not 1. Even though the metric name is the same (what Prometheus calls a "metric family name"), there are two unique combinations of labels - one with "Code Cache" and one with "PS Eden Space" - so they are 2 distinct time series metric data.
Today, the agent only creates a single Hawkular-Metric in this case, with each datapoint tagged with those Prometheus labels on the appropriate data point. But we don't want to aggregate them like that since we lose the granularity that the Prometheus endpoint gives us (that is, the number of bytes committed in each pool type). I will say I think we might be able to get that granularity back through datapoint tag queries in Hawkular-Metrics but I don't know how well (if at all) that is supported and how efficient such queries would be even if supported, and how efficient storage of these metrics would be if we tag every data point with these labels (not sure if that is the general purpose of tags in H-Metrics). But, regardless, the fact that these really are different time series metrics should (IMO) be represented as different time series metrics (via metric definitions/metric IDs) in Hawkular-Metrics.
To support labeled Prometheus endpoint data like this, the agent needs to split this one named metric into N Hawkular-Metrics metrics (where N is the number of unique label combinations for that named metric). So even though the agent is configured with the one metric "jvm_memory_pool_bytes_committed" we need to actually create two Hawkular-Metric metric definitions (with two different and unique metric IDs obviously).
The PR [1] that is ready to go does this. By default it will create multiple metric definitions/metric IDs in the form "metric-family-name{labelName1=labelValue1,labelName2=labelValue2,...}" unless you want a different form in which case you can define an "id" and put in "${labelName}" in the ID you declare (such as "${oneLabelName}_my_own_metric_name_${theOtherLabelName}" or whatever). But I suspect the default format will be what most people want and thus nothing needs to be done. In the above example, two metric definitions with the following IDs are created:
1. jvm_memory_pool_bytes_committed{pool=Code Cache}
2. jvm_memory_pool_bytes_committed{pool=PS Eden Space}
--John Mazz
[1] https://github.com/hawkular/hawkular-openshift-agent/pull/117
7 years, 2 months
Collecting PV usage ?
by Thomas Heute
Mazz,
in your metric collection adventure for HOSA have you met a way to see the
usage of PVs attached to a pod ?
User should know (be able to visualize) how much of the PVs are used and
then be alerted if it reach a certain %.
Thomas
7 years, 2 months
HOSA now limits amount of metrics per pod; new agent metrics added
by John Mazzitelli
FYI: New enhancement to Hawkular OpenShift Agent (HOSA).
To avoid having a misconfigured or malicious pod from flooding HOSA and H-Metrics with large amounts of metric data, HOSA has now been enhanced to support the setting of "max_metrics_per_pod" (this is a setting in the agent global configuration). Its default is 50. Any pod that asks the agent to collect more than that (sum total across all of its endpoints) will be throttled down and only the maximum number of metrics will be stored for that pod. Note: when I say "metrics" here I do not mean datapoints - this limits the number of unique metric IDs allowed to be stored per pod)
If you enable the status endpoint, you'll see this in the yaml report when a max limit is reached for the endpoint in question:
openshift-infra|the-pod-name-73fgt|prometheus|http://172.19.0.5:8080/metrics: METRIC
LIMIT EXCEEDED. Last collection at [Sat, 11 Feb 2017 13:46:44 +0000] gathered
[54] metrics, [4] were discarded, in [1.697787ms]
A warning will also be logged in the log file:
"Reached max limit of metrics for [openshift-infra|the-pod-name-73fgt|prometheus|http://172.19.0.5:8080/metrics] - discarding [4] collected metrics"
(As part of this code change, the status endpoint was enhanced to now show the number of metrics collected from each endpoint under each pod. This is not the total number of datapoints; it is showing unique metric IDs - this number will always be <= the max metrics per pod)
Finally, the agent now collects and emits 4 metrics of its own (in addition to all the other "go" related ones like memory used, etc). They are:
1 Counter:
hawkular_openshift_agent_metric_data_points_collected_total
The total number of individual metric data points collected from all endpoints.
3 Gauges:
hawkular_openshift_agent_monitored_pods
The number of pods currently being monitored.
hawkular_openshift_agent_monitored_endpoints
The number of endpoints currently being monitored.
hawkular_openshift_agent_monitored_metrics
The total number of metrics currently being monitored across all endpoints.
All of this is in master and will be in the next HOSA release, which I hope to do this weekend.
7 years, 2 months
Hawkular Metrics 0.24.0 - Release
by Stefan Negrea
Hello,
I am happy to announce release 0.24.0 of Hawkular Metrics. This release is
anchored by a new tag query language and general stability improvements.
Here is a list of major changes:
- *Tag Query Language*
- A query language was added to support complex constructs for tag
based queries for metrics
- The old tag query syntax is deprecated but can still be used; the
new syntax takes precedence
- The new syntax supports:
- logical operators: AND,OR
- equality operators: =, !=
- value in array operators: IN, NOT IN
- existential conditions:
- tag without any operator is equivalent to = '*'
- tag preceded by the NOT operator matches only instances
without the tag defined
- all the values in between single quotes are treated as regex
expressions
- simple text values do not need single quotes
- spaces before and after equality operators are not necessary
- For more details please see: Pull Request 725
<https://github.com/hawkular/hawkular-metrics/pull/725>,
HWKMETRICS-523 <https://issues.jboss.org/browse/HWKMETRICS-523>
- Sample queries:
a1 = 'bcd' OR a2 != 'efg'
a1='bcd' OR a2!='efg'
a1 = efg AND ( a2 = 'hijk' OR a2 = 'xyz' )
a1 = 'efg' AND ( a2 IN ['hijk', 'xyz'] )
a1 = 'efg' AND a2 NOT IN ['hijk']
a1 = 'd' OR ( a1 != 'ab' AND ( c1 = '*' ) )
a1 OR a2
NOT a1 AND a2
a1 = 'a' AND NOT b2
a1 = a AND NOT b2
- *Performance*
- Updated compaction strategies for data tables from size tiered
compaction (STCS) to time window compaction (TWCS) (HWKMETRICS-556
<https://issues.jboss.org/browse/HWKMETRICS-556>)
- Jobs now execute on RxJava's I/O scheduler thread pool (
HWKMETRICS-579 <https://issues.jboss.org/browse/HWKMETRICS-579>)
- *Administration*
- The admin tenant is now configurable via ADMIN_TENANT environment
variable (HWKMETRICS-572
<https://issues.jboss.org/browse/HWKMETRICS-572>)
- Internal metric collection is disabled by default (HWKMETRICS-578
<https://issues.jboss.org/browse/HWKMETRICS-578>)
- Resolved a null pointer exception in DropWizardReporter due to
admin tenant changes (HWKMETRICS-577
<https://issues.jboss.org/browse/HWKMETRICS-577>)
- *Job Scheduler*
- Resolved an issue where the compression job would stop running
after a few days (HWKMETRICS-564
<https://issues.jboss.org/browse/HWKMETRICS-564>)
- Updated the job scheduler to renew job locks during job execution (
HWKMETRICS-570 <https://issues.jboss.org/browse/HWKMETRICS-570>)
- Updated the job scheduler to reacquire job lock after server
restarts (HWKMETRICS-583
<https://issues.jboss.org/browse/HWKMETRICS-583>)
- *Hawkular Alerting - Major Updates*
- Resolved several issues where schema upgrades were not applied
after the initial schema install (HWKALERTS-220
<https://issues.jboss.org/browse/HWKALERTS-220>, HWKALERTS-222
<https://issues.jboss.org/browse/HWKALERTS-222>)
*Hawkular Alerting - Included*
- Version 1.5.1
<https://issues.jboss.org/projects/HWKALERTS/versions/12333065>
- Project details and repository: Github
<https://github.com/hawkular/hawkular-alerts>
- Documentation: REST API
<http://www.hawkular.org/docs/rest/rest-alerts.html>, Examples
<https://github.com/hawkular/hawkular-alerts/tree/master/examples>,
Developer
Guide
<http://www.hawkular.org/community/docs/developer-guide/alerts.html>
*Hawkular Metrics Clients*
- Python: https://github.com/hawkular/hawkular-client-python
- Go: https://github.com/hawkular/hawkular-client-go
- Ruby: https://github.com/hawkular/hawkular-client-ruby
- Java: https://github.com/hawkular/hawkular-client-java
*Release Links*
Github Release:
https://github.com/hawkular/hawkular-metrics/releases/tag/0.24.0
JBoss Nexus Maven artifacts:
http://origin-repository.jboss.org/nexus/content/repositorie
s/public/org/hawkular/metrics/
Jira release tracker:
https://issues.jboss.org/projects/HWKMETRICS/versions/12332966
A big "Thank you" goes to John Sanda, Matt Wringe, Michael Burman, Joel
Takvorian, Jay Shaughnessy, Lucas Ponce, and Heiko Rupp for their project
contributions.
Thank you,
Stefan Negrea
7 years, 2 months
[metrics] configurable data retention
by John Sanda
Pretty much from the start of the project we have provided configurable data retention. There is a system-wide default retention that can be set at start up. You can also set the data retention per tenant as well as per individual metric. Do we need to provide this fine-grained level of configurability, or is it sufficient to only have a system-wide data retention which is configurable?
It is worth noting that in OpenShift *only* the system-wide data retention is set. Recently we have been dealing with a number of production issues including:
* Cassandra crashing with an OutOfMemoryError
* Stats queries failing in Hawkular Metrics due to high read latencies in Cassandra
* Expired data not getting purged in a timely fashion
These issues all involve compaction. In older versions of Hawkular Metrics we were using the default, size tiered compaction strategy (STCS). Time window compaction strategy (TWCS) is better suited and for time series data such as our. We are already seeing good results with some early testing. Using the correct and properly configured compaction strategy can have a significant impact on several things including:
* I/O usage
* cpu usage
* read performance
* disk usage
TWCS was developed for some very specific use cases which are common use cases with Cassandra. TWCS is recommended for time series that meet the following criteria:
* append-only writes
* no deletes
* global (i.e., table-wide) TTL
* few out of order writes (at least it is the exception and not the norm)
It is the third bullet which has prompted this email. If we allow/support different TTLs per tenant and/or per metric we will lose a lot of the benefits of TWCS and likely continue to struggle with some of the issues we have been facing as of late. If you ask me exactly how well or poorly will compaction perform using mixed TTLs, I can only speculate. I simply do not have the bandwidth to test things that C* docs and C* devs say *not* to do.
I am of the opinion, at least for OpenShift users, that disk usage is much more important than fine-grained data retentions. A big question I have is, what about outside of OpenShift? This may be a question for some people not on this list, so I want to make sure it does reach the right people.
I think we could potentially tie together configurable data retention with rollups. Let’s say we add support for 15min, 1hr, 6hr, and 24hr rollups where each rollup is stored in its own table and each larger rollup having a larger retention. Different levels of data retentions could be used to determine what rollups a tenant has. If a tenant wants a data retention of a month for example, then that could translate into generating 15min and 1hr rollups for that tenant.
- John
7 years, 2 months
Upgrade to Wildfly 1.1.0.Final
by Lucas Ponce
Hello,
Is there any objection / potential problem if we upgrade from 1.0.0.Final to 1.1.0.Final ?
During investigation of a clustering issue, there are some fixes related that seems to be packaged on 1.1.0.Final.
I am still working on this but I would want to know if upgrading the Wildfly version in parent could have consensus.
Thanks,
Lucas
7 years, 2 months
hosa - /metrics can be behind auth; 2 new metrics
by John Mazzitelli
[this is more for Matt W, but will post here]
Two new things in HOSA - these have been released under the 1.2.0.Final version and is available on docker hub - see https://hub.docker.com/r/hawkular/hawkular-openshift-agent/tags/
1) Hawkular WildFly Agent has its own metrics endpoint (so it can monitor itself). The endpoint is /metrics. This is nothing new.
But the /metrics can now be configured behind basic auth. If you configure this in the agent config, you must authenticate to see the metrics:
emitter:
metrics_credentials:
username: foo
password: bar
You can pass these in via env. vars and thus you can use OpenShift secrets for it.
2) There are now two new metrics (both gauges) the agent itself emits:
hawkular_openshift_agent_monitored_pods (The number of pods currently being monitored)
hawkular_openshift_agent_monitored_endpoints (The number of endpoints currently being monitored)
That is all.
7 years, 2 months