February 2017 - hawkular-dev - Jboss List Archives

by Michael Burman

Hi, I did yesterday evening and today some testing on how using RxJava2 would benefit us (I'm expecting more from RxJava 2.1 actually, since it has some enhanced parallelism features which we might benefit from). Short notes from RxJava2 migration, it's more painful than I assumed. The code changes can be small in terms of lines of code changed, but almost every method has had their signature or behavior changed. So at least I've had to read the documentation all the time when doing things and trying to unlearn what I've done in the RxJava1. And all this comes with a backwards compatibility pressure for Java 6 (so you can't benefit from many Java 8 advantages). Reactive-Commons / Reactor have started from Java 8 to provide cleaner implementation. Grr. I wrote a simple write path modification in the PR #762 (metrics) that writes Gauges using RxJava2 ported micro-batching feature. There's still some RxJavaInterOp use in it, so that might slow down the performance a little bit. However, it is possible to merge these two codes. There are also some other optimizations I think could be worth it. I'd advice against it though, reading gets quite complex. I would almost suggest that we would do the MetricsServiceImpl/DataAccessImpl merging by rewriting small parts at a time in the new class with RxJava2 and make that call the old code with RxJavaInterOp. That way we could move slowly to the newer codebase. I fixed the JMH-benchmarks (as they're not compiled in our CI and were actually broken by some other PRs) and ran some tests. These are the tests that measure only the metrics-core-service performance and do not touch the REST-interface (or Wildfly) at all, thus giving better comparison in how our internal changes behave. What I'm seeing is around 20-30% difference in performance when writing gauges this way. So this should offset some of the issues we saw when we improved error handling (which caused performance degradation). I did ran into the HWKMETRICS-542 (BusyPoolException) so the tests were run with 1024 connections. I'll continue next week some more testing, but at the same time I proved that the micro-batching features do improve performance in the internal processing, especially when there's small amount of writers to a single node. But testing those features could probably benefit from more benchmark tests without WIldfly (which takes so much processing power that most performance improvements can't be measured correctly anymore). - Micke

8 years, 10 months

2
2
0 / 0

HOSA and conversion from prometheus to hawkular metrics

by John Mazzitelli

The past several days I've been working on an enhancement to HOSA that came in from the community (in fact, I would consider it a bug). I'm about ready to merge the PR [1] for this and do a HOSA 1.1.0.Final release. I wanted to post this to announce it and see if there is any feedback, too. Today, HOSA collects metrics from any Prometheus endpoint which you declare - example: metrics - name: go_memstats_sys_bytes - name: process_max_fds - name: process_open_fds But if a Prometheus metric has labels, Prometheus itself considers each metric with a unique combination of labels as an individual time series metric. This is different than how Hawkular Metric works - each Hawkular Metric metric ID (even if its metric definition or its datapoints have tags) is a single time series metric. We need to account for this difference. For example, if our agent is configured with: metrics: - name: jvm_memory_pool_bytes_committed And the Prometheus endpoint emits that metric with a label called "pool" like this: jvm_memory_pool_bytes_committed{pool="Code Cache",} 2.7787264E7 jvm_memory_pool_bytes_committed{pool="PS Eden Space",} 2.3068672E7 then to Prometheus this is actually 2 time series metrics (the number of bytes committed per pool type), not 1. Even though the metric name is the same (what Prometheus calls a "metric family name"), there are two unique combinations of labels - one with "Code Cache" and one with "PS Eden Space" - so they are 2 distinct time series metric data. Today, the agent only creates a single Hawkular-Metric in this case, with each datapoint tagged with those Prometheus labels on the appropriate data point. But we don't want to aggregate them like that since we lose the granularity that the Prometheus endpoint gives us (that is, the number of bytes committed in each pool type). I will say I think we might be able to get that granularity back through datapoint tag queries in Hawkular-Metrics but I don't know how well (if at all) that is supported and how efficient such queries would be even if supported, and how efficient storage of these metrics would be if we tag every data point with these labels (not sure if that is the general purpose of tags in H-Metrics). But, regardless, the fact that these really are different time series metrics should (IMO) be represented as different time series metrics (via metric definitions/metric IDs) in Hawkular-Metrics. To support labeled Prometheus endpoint data like this, the agent needs to split this one named metric into N Hawkular-Metrics metrics (where N is the number of unique label combinations for that named metric). So even though the agent is configured with the one metric "jvm_memory_pool_bytes_committed" we need to actually create two Hawkular-Metric metric definitions (with two different and unique metric IDs obviously). The PR [1] that is ready to go does this. By default it will create multiple metric definitions/metric IDs in the form "metric-family-name{labelName1=labelValue1,labelName2=labelValue2,...}" unless you want a different form in which case you can define an "id" and put in "${labelName}" in the ID you declare (such as "${oneLabelName}_my_own_metric_name_${theOtherLabelName}" or whatever). But I suspect the default format will be what most people want and thus nothing needs to be done. In the above example, two metric definitions with the following IDs are created: 1. jvm_memory_pool_bytes_committed{pool=Code Cache} 2. jvm_memory_pool_bytes_committed{pool=PS Eden Space} --John Mazz [1] https://github.com/hawkular/hawkular-openshift-agent/pull/117

8 years, 10 months

3
9
0 / 0

Collecting PV usage ?

by Thomas Heute

Mazz, in your metric collection adventure for HOSA have you met a way to see the usage of PVs attached to a pod ? User should know (be able to visualize) how much of the PVs are used and then be alerted if it reach a certain %. Thomas

8 years, 11 months

3
6
0 / 0

HOSA now limits amount of metrics per pod; new agent metrics added

by John Mazzitelli

FYI: New enhancement to Hawkular OpenShift Agent (HOSA). To avoid having a misconfigured or malicious pod from flooding HOSA and H-Metrics with large amounts of metric data, HOSA has now been enhanced to support the setting of "max_metrics_per_pod" (this is a setting in the agent global configuration). Its default is 50. Any pod that asks the agent to collect more than that (sum total across all of its endpoints) will be throttled down and only the maximum number of metrics will be stored for that pod. Note: when I say "metrics" here I do not mean datapoints - this limits the number of unique metric IDs allowed to be stored per pod) If you enable the status endpoint, you'll see this in the yaml report when a max limit is reached for the endpoint in question: openshift-infra|the-pod-name-73fgt|prometheus|http://172.19.0.5:8080/metrics: METRIC LIMIT EXCEEDED. Last collection at [Sat, 11 Feb 2017 13:46:44 +0000] gathered [54] metrics, [4] were discarded, in [1.697787ms] A warning will also be logged in the log file: "Reached max limit of metrics for [openshift-infra|the-pod-name-73fgt|prometheus|http://172.19.0.5:8080/metrics] - discarding [4] collected metrics" (As part of this code change, the status endpoint was enhanced to now show the number of metrics collected from each endpoint under each pod. This is not the total number of datapoints; it is showing unique metric IDs - this number will always be <= the max metrics per pod) Finally, the agent now collects and emits 4 metrics of its own (in addition to all the other "go" related ones like memory used, etc). They are: 1 Counter: hawkular_openshift_agent_metric_data_points_collected_total The total number of individual metric data points collected from all endpoints. 3 Gauges: hawkular_openshift_agent_monitored_pods The number of pods currently being monitored. hawkular_openshift_agent_monitored_endpoints The number of endpoints currently being monitored. hawkular_openshift_agent_monitored_metrics The total number of metrics currently being monitored across all endpoints. All of this is in master and will be in the next HOSA release, which I hope to do this weekend.

8 years, 11 months

1
2
0 / 0

Hawkular Metrics 0.24.0 - Release

by Stefan Negrea

Hello, I am happy to announce release 0.24.0 of Hawkular Metrics. This release is anchored by a new tag query language and general stability improvements. Here is a list of major changes: - *Tag Query Language* - A query language was added to support complex constructs for tag based queries for metrics - The old tag query syntax is deprecated but can still be used; the new syntax takes precedence - The new syntax supports: - logical operators: AND,OR - equality operators: =, != - value in array operators: IN, NOT IN - existential conditions: - tag without any operator is equivalent to = '*' - tag preceded by the NOT operator matches only instances without the tag defined - all the values in between single quotes are treated as regex expressions - simple text values do not need single quotes - spaces before and after equality operators are not necessary - For more details please see: Pull Request 725 <https://github.com/hawkular/hawkular-metrics/pull/725>, HWKMETRICS-523 <https://issues.jboss.org/browse/HWKMETRICS-523> - Sample queries: a1 = 'bcd' OR a2 != 'efg' a1='bcd' OR a2!='efg' a1 = efg AND ( a2 = 'hijk' OR a2 = 'xyz' ) a1 = 'efg' AND ( a2 IN ['hijk', 'xyz'] ) a1 = 'efg' AND a2 NOT IN ['hijk'] a1 = 'd' OR ( a1 != 'ab' AND ( c1 = '*' ) ) a1 OR a2 NOT a1 AND a2 a1 = 'a' AND NOT b2 a1 = a AND NOT b2 - *Performance* - Updated compaction strategies for data tables from size tiered compaction (STCS) to time window compaction (TWCS) (HWKMETRICS-556 <https://issues.jboss.org/browse/HWKMETRICS-556>) - Jobs now execute on RxJava's I/O scheduler thread pool ( HWKMETRICS-579 <https://issues.jboss.org/browse/HWKMETRICS-579>) - *Administration* - The admin tenant is now configurable via ADMIN_TENANT environment variable (HWKMETRICS-572 <https://issues.jboss.org/browse/HWKMETRICS-572>) - Internal metric collection is disabled by default (HWKMETRICS-578 <https://issues.jboss.org/browse/HWKMETRICS-578>) - Resolved a null pointer exception in DropWizardReporter due to admin tenant changes (HWKMETRICS-577 <https://issues.jboss.org/browse/HWKMETRICS-577>) - *Job Scheduler* - Resolved an issue where the compression job would stop running after a few days (HWKMETRICS-564 <https://issues.jboss.org/browse/HWKMETRICS-564>) - Updated the job scheduler to renew job locks during job execution ( HWKMETRICS-570 <https://issues.jboss.org/browse/HWKMETRICS-570>) - Updated the job scheduler to reacquire job lock after server restarts (HWKMETRICS-583 <https://issues.jboss.org/browse/HWKMETRICS-583>) - *Hawkular Alerting - Major Updates* - Resolved several issues where schema upgrades were not applied after the initial schema install (HWKALERTS-220 <https://issues.jboss.org/browse/HWKALERTS-220>, HWKALERTS-222 <https://issues.jboss.org/browse/HWKALERTS-222>) *Hawkular Alerting - Included* - Version 1.5.1 <https://issues.jboss.org/projects/HWKALERTS/versions/12333065> - Project details and repository: Github <https://github.com/hawkular/hawkular-alerts> - Documentation: REST API <http://www.hawkular.org/docs/rest/rest-alerts.html>, Examples <https://github.com/hawkular/hawkular-alerts/tree/master/examples>, Developer Guide <http://www.hawkular.org/community/docs/developer-guide/alerts.html> *Hawkular Metrics Clients* - Python: https://github.com/hawkular/hawkular-client-python - Go: https://github.com/hawkular/hawkular-client-go - Ruby: https://github.com/hawkular/hawkular-client-ruby - Java: https://github.com/hawkular/hawkular-client-java *Release Links* Github Release: https://github.com/hawkular/hawkular-metrics/releases/tag/0.24.0 JBoss Nexus Maven artifacts: http://origin-repository.jboss.org/nexus/content/repositorie s/public/org/hawkular/metrics/ Jira release tracker: https://issues.jboss.org/projects/HWKMETRICS/versions/12332966 A big "Thank you" goes to John Sanda, Matt Wringe, Michael Burman, Joel Takvorian, Jay Shaughnessy, Lucas Ponce, and Heiko Rupp for their project contributions. Thank you, Stefan Negrea

8 years, 11 months

2
1
0 / 0

[metrics] configurable data retention

by John Sanda

Pretty much from the start of the project we have provided configurable data retention. There is a system-wide default retention that can be set at start up. You can also set the data retention per tenant as well as per individual metric. Do we need to provide this fine-grained level of configurability, or is it sufficient to only have a system-wide data retention which is configurable? It is worth noting that in OpenShift *only* the system-wide data retention is set. Recently we have been dealing with a number of production issues including: * Cassandra crashing with an OutOfMemoryError * Stats queries failing in Hawkular Metrics due to high read latencies in Cassandra * Expired data not getting purged in a timely fashion These issues all involve compaction. In older versions of Hawkular Metrics we were using the default, size tiered compaction strategy (STCS). Time window compaction strategy (TWCS) is better suited and for time series data such as our. We are already seeing good results with some early testing. Using the correct and properly configured compaction strategy can have a significant impact on several things including: * I/O usage * cpu usage * read performance * disk usage TWCS was developed for some very specific use cases which are common use cases with Cassandra. TWCS is recommended for time series that meet the following criteria: * append-only writes * no deletes * global (i.e., table-wide) TTL * few out of order writes (at least it is the exception and not the norm) It is the third bullet which has prompted this email. If we allow/support different TTLs per tenant and/or per metric we will lose a lot of the benefits of TWCS and likely continue to struggle with some of the issues we have been facing as of late. If you ask me exactly how well or poorly will compaction perform using mixed TTLs, I can only speculate. I simply do not have the bandwidth to test things that C* docs and C* devs say *not* to do. I am of the opinion, at least for OpenShift users, that disk usage is much more important than fine-grained data retentions. A big question I have is, what about outside of OpenShift? This may be a question for some people not on this list, so I want to make sure it does reach the right people. I think we could potentially tie together configurable data retention with rollups. Let’s say we add support for 15min, 1hr, 6hr, and 24hr rollups where each rollup is stored in its own table and each larger rollup having a larger retention. Different levels of data retentions could be used to determine what rollups a tenant has. If a tenant wants a data retention of a month for example, then that could translate into generating 15min and 1hr rollups for that tenant. - John

8 years, 11 months

1
0
0 / 0

Upgrade to Wildfly 1.1.0.Final

by Lucas Ponce

Hello, Is there any objection / potential problem if we upgrade from 1.0.0.Final to 1.1.0.Final ? During investigation of a clustering issue, there are some fixes related that seems to be packaged on 1.1.0.Final. I am still working on this but I would want to know if upgrading the Wildfly version in parent could have consensus. Thanks, Lucas

8 years, 11 months

5
6
0 / 0

hosa - /metrics can be behind auth; 2 new metrics

by John Mazzitelli

[this is more for Matt W, but will post here] Two new things in HOSA - these have been released under the 1.2.0.Final version and is available on docker hub - see https://hub.docker.com/r/hawkular/hawkular-openshift-agent/tags/ 1) Hawkular WildFly Agent has its own metrics endpoint (so it can monitor itself). The endpoint is /metrics. This is nothing new. But the /metrics can now be configured behind basic auth. If you configure this in the agent config, you must authenticate to see the metrics: emitter: metrics_credentials: username: foo password: bar You can pass these in via env. vars and thus you can use OpenShift secrets for it. 2) There are now two new metrics (both gauges) the agent itself emits: hawkular_openshift_agent_monitored_pods (The number of pods currently being monitored) hawkular_openshift_agent_monitored_endpoints (The number of endpoints currently being monitored) That is all.

8 years, 11 months

1
0
0 / 0

Hawkular-services 0.31 released

by Heiko W.Rupp

Hello, Hawkular-services 0.31 was just released. Major change to 0.30 [1] is the update of HAM to 0.23.4. I have pushed updated docker images for pilhuhn/hawkular-services. [1] http://www.hawkular.org/blog/2017/01/31/hawkular-services-0.30-released.html

8 years, 11 months

1
0
0 / 0

Hawkular APM 0.14.0.Final Released

by Gary Brown

Hi We are pleased to announce that version 0.14.0.Final of Hawkular APM has been released. The details for the release can be found here: https://github.com/hawkular/hawkular-apm/releases/tag/0.14.0.Final The main new feature is a UI for comparing the performance of service versions, particularly useful in a continuous deployment environment (in combination with strategies such as canary, blue/green, a/b) to assess the quality of newly deployed versions. This new feature is shown in a recent blog: http://www.hawkular.org/blog/2017/02/04/hawkular-apm-service-deployments.... Regards Gary

8 years, 11 months

1
0
0 / 0

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

hawkular-dev February 2017