HOSA now limits amount of metrics per pod; new agent metrics added
by John Mazzitelli
FYI: New enhancement to Hawkular OpenShift Agent (HOSA).
To avoid having a misconfigured or malicious pod from flooding HOSA and H-Metrics with large amounts of metric data, HOSA has now been enhanced to support the setting of "max_metrics_per_pod" (this is a setting in the agent global configuration). Its default is 50. Any pod that asks the agent to collect more than that (sum total across all of its endpoints) will be throttled down and only the maximum number of metrics will be stored for that pod. Note: when I say "metrics" here I do not mean datapoints - this limits the number of unique metric IDs allowed to be stored per pod)
If you enable the status endpoint, you'll see this in the yaml report when a max limit is reached for the endpoint in question:
openshift-infra|the-pod-name-73fgt|prometheus|http://172.19.0.5:8080/metrics: METRIC
LIMIT EXCEEDED. Last collection at [Sat, 11 Feb 2017 13:46:44 +0000] gathered
[54] metrics, [4] were discarded, in [1.697787ms]
A warning will also be logged in the log file:
"Reached max limit of metrics for [openshift-infra|the-pod-name-73fgt|prometheus|http://172.19.0.5:8080/metrics] - discarding [4] collected metrics"
(As part of this code change, the status endpoint was enhanced to now show the number of metrics collected from each endpoint under each pod. This is not the total number of datapoints; it is showing unique metric IDs - this number will always be <= the max metrics per pod)
Finally, the agent now collects and emits 4 metrics of its own (in addition to all the other "go" related ones like memory used, etc). They are:
1 Counter:
hawkular_openshift_agent_metric_data_points_collected_total
The total number of individual metric data points collected from all endpoints.
3 Gauges:
hawkular_openshift_agent_monitored_pods
The number of pods currently being monitored.
hawkular_openshift_agent_monitored_endpoints
The number of endpoints currently being monitored.
hawkular_openshift_agent_monitored_metrics
The total number of metrics currently being monitored across all endpoints.
All of this is in master and will be in the next HOSA release, which I hope to do this weekend.
7 years, 9 months
Hawkular Metrics 0.24.0 - Release
by Stefan Negrea
Hello,
I am happy to announce release 0.24.0 of Hawkular Metrics. This release is
anchored by a new tag query language and general stability improvements.
Here is a list of major changes:
- *Tag Query Language*
- A query language was added to support complex constructs for tag
based queries for metrics
- The old tag query syntax is deprecated but can still be used; the
new syntax takes precedence
- The new syntax supports:
- logical operators: AND,OR
- equality operators: =, !=
- value in array operators: IN, NOT IN
- existential conditions:
- tag without any operator is equivalent to = '*'
- tag preceded by the NOT operator matches only instances
without the tag defined
- all the values in between single quotes are treated as regex
expressions
- simple text values do not need single quotes
- spaces before and after equality operators are not necessary
- For more details please see: Pull Request 725
<https://github.com/hawkular/hawkular-metrics/pull/725>,
HWKMETRICS-523 <https://issues.jboss.org/browse/HWKMETRICS-523>
- Sample queries:
a1 = 'bcd' OR a2 != 'efg'
a1='bcd' OR a2!='efg'
a1 = efg AND ( a2 = 'hijk' OR a2 = 'xyz' )
a1 = 'efg' AND ( a2 IN ['hijk', 'xyz'] )
a1 = 'efg' AND a2 NOT IN ['hijk']
a1 = 'd' OR ( a1 != 'ab' AND ( c1 = '*' ) )
a1 OR a2
NOT a1 AND a2
a1 = 'a' AND NOT b2
a1 = a AND NOT b2
- *Performance*
- Updated compaction strategies for data tables from size tiered
compaction (STCS) to time window compaction (TWCS) (HWKMETRICS-556
<https://issues.jboss.org/browse/HWKMETRICS-556>)
- Jobs now execute on RxJava's I/O scheduler thread pool (
HWKMETRICS-579 <https://issues.jboss.org/browse/HWKMETRICS-579>)
- *Administration*
- The admin tenant is now configurable via ADMIN_TENANT environment
variable (HWKMETRICS-572
<https://issues.jboss.org/browse/HWKMETRICS-572>)
- Internal metric collection is disabled by default (HWKMETRICS-578
<https://issues.jboss.org/browse/HWKMETRICS-578>)
- Resolved a null pointer exception in DropWizardReporter due to
admin tenant changes (HWKMETRICS-577
<https://issues.jboss.org/browse/HWKMETRICS-577>)
- *Job Scheduler*
- Resolved an issue where the compression job would stop running
after a few days (HWKMETRICS-564
<https://issues.jboss.org/browse/HWKMETRICS-564>)
- Updated the job scheduler to renew job locks during job execution (
HWKMETRICS-570 <https://issues.jboss.org/browse/HWKMETRICS-570>)
- Updated the job scheduler to reacquire job lock after server
restarts (HWKMETRICS-583
<https://issues.jboss.org/browse/HWKMETRICS-583>)
- *Hawkular Alerting - Major Updates*
- Resolved several issues where schema upgrades were not applied
after the initial schema install (HWKALERTS-220
<https://issues.jboss.org/browse/HWKALERTS-220>, HWKALERTS-222
<https://issues.jboss.org/browse/HWKALERTS-222>)
*Hawkular Alerting - Included*
- Version 1.5.1
<https://issues.jboss.org/projects/HWKALERTS/versions/12333065>
- Project details and repository: Github
<https://github.com/hawkular/hawkular-alerts>
- Documentation: REST API
<http://www.hawkular.org/docs/rest/rest-alerts.html>, Examples
<https://github.com/hawkular/hawkular-alerts/tree/master/examples>,
Developer
Guide
<http://www.hawkular.org/community/docs/developer-guide/alerts.html>
*Hawkular Metrics Clients*
- Python: https://github.com/hawkular/hawkular-client-python
- Go: https://github.com/hawkular/hawkular-client-go
- Ruby: https://github.com/hawkular/hawkular-client-ruby
- Java: https://github.com/hawkular/hawkular-client-java
*Release Links*
Github Release:
https://github.com/hawkular/hawkular-metrics/releases/tag/0.24.0
JBoss Nexus Maven artifacts:
http://origin-repository.jboss.org/nexus/content/repositorie
s/public/org/hawkular/metrics/
Jira release tracker:
https://issues.jboss.org/projects/HWKMETRICS/versions/12332966
A big "Thank you" goes to John Sanda, Matt Wringe, Michael Burman, Joel
Takvorian, Jay Shaughnessy, Lucas Ponce, and Heiko Rupp for their project
contributions.
Thank you,
Stefan Negrea
7 years, 9 months
[metrics] configurable data retention
by John Sanda
Pretty much from the start of the project we have provided configurable data retention. There is a system-wide default retention that can be set at start up. You can also set the data retention per tenant as well as per individual metric. Do we need to provide this fine-grained level of configurability, or is it sufficient to only have a system-wide data retention which is configurable?
It is worth noting that in OpenShift *only* the system-wide data retention is set. Recently we have been dealing with a number of production issues including:
* Cassandra crashing with an OutOfMemoryError
* Stats queries failing in Hawkular Metrics due to high read latencies in Cassandra
* Expired data not getting purged in a timely fashion
These issues all involve compaction. In older versions of Hawkular Metrics we were using the default, size tiered compaction strategy (STCS). Time window compaction strategy (TWCS) is better suited and for time series data such as our. We are already seeing good results with some early testing. Using the correct and properly configured compaction strategy can have a significant impact on several things including:
* I/O usage
* cpu usage
* read performance
* disk usage
TWCS was developed for some very specific use cases which are common use cases with Cassandra. TWCS is recommended for time series that meet the following criteria:
* append-only writes
* no deletes
* global (i.e., table-wide) TTL
* few out of order writes (at least it is the exception and not the norm)
It is the third bullet which has prompted this email. If we allow/support different TTLs per tenant and/or per metric we will lose a lot of the benefits of TWCS and likely continue to struggle with some of the issues we have been facing as of late. If you ask me exactly how well or poorly will compaction perform using mixed TTLs, I can only speculate. I simply do not have the bandwidth to test things that C* docs and C* devs say *not* to do.
I am of the opinion, at least for OpenShift users, that disk usage is much more important than fine-grained data retentions. A big question I have is, what about outside of OpenShift? This may be a question for some people not on this list, so I want to make sure it does reach the right people.
I think we could potentially tie together configurable data retention with rollups. Let’s say we add support for 15min, 1hr, 6hr, and 24hr rollups where each rollup is stored in its own table and each larger rollup having a larger retention. Different levels of data retentions could be used to determine what rollups a tenant has. If a tenant wants a data retention of a month for example, then that could translate into generating 15min and 1hr rollups for that tenant.
- John
7 years, 9 months
Upgrade to Wildfly 1.1.0.Final
by Lucas Ponce
Hello,
Is there any objection / potential problem if we upgrade from 1.0.0.Final to 1.1.0.Final ?
During investigation of a clustering issue, there are some fixes related that seems to be packaged on 1.1.0.Final.
I am still working on this but I would want to know if upgrading the Wildfly version in parent could have consensus.
Thanks,
Lucas
7 years, 9 months
hosa - /metrics can be behind auth; 2 new metrics
by John Mazzitelli
[this is more for Matt W, but will post here]
Two new things in HOSA - these have been released under the 1.2.0.Final version and is available on docker hub - see https://hub.docker.com/r/hawkular/hawkular-openshift-agent/tags/
1) Hawkular WildFly Agent has its own metrics endpoint (so it can monitor itself). The endpoint is /metrics. This is nothing new.
But the /metrics can now be configured behind basic auth. If you configure this in the agent config, you must authenticate to see the metrics:
emitter:
metrics_credentials:
username: foo
password: bar
You can pass these in via env. vars and thus you can use OpenShift secrets for it.
2) There are now two new metrics (both gauges) the agent itself emits:
hawkular_openshift_agent_monitored_pods (The number of pods currently being monitored)
hawkular_openshift_agent_monitored_endpoints (The number of endpoints currently being monitored)
That is all.
7 years, 9 months
Re: [Hawkular-dev] Hawkular APM and instrumenting clojure
by Neil Okamoto
Since this morning I've had a server running inside docker on a separate
machine with more installed memory. I haven't seen any problems since then.
In retrospect I wish I had thought of this sooner.
For now I'm moving on from this problem to do a more complete
instrumentation of the clojure app. I'll keep an eye open for further
problems and I'll report back if there's anything noteworthy.
thanks Gary,
Neil
On Fri, Feb 3, 2017 at 7:53 AM, Neil Okamoto <neil.okamoto(a)gmail.com> wrote:
> Thanks Gary. I'll try running the server outside docker, but before I do
> that I'm going to run the container on a machine with more memory.
>
> > On Feb 3, 2017, at 7:15 AM, Gary Brown <gbrown(a)redhat.com> wrote:
> >
> > Hi Neil
> >
> > Sounds strange. Would it be possible to try running the server outside
> docker to see if there may be issues there.
> >
> > If you create the jira with reproducer then we will investigate aswell.
> >
> > Thanks for the additional info.
> >
> > Regards
> > Gary
> >
> > ----- Original Message -----
> >> Thanks Gary.
> >>
> >> On Fri, Feb 3, 2017 at 1:54 AM, Gary Brown < gbrown(a)redhat.com > wrote:
> >>
> >>
> >>
> >>> (1) Is using a "sampling.priority" of 1 merely advisory? It would
> explain
> >>> everything if those traces are meant to be dropped.
> >>
> >> If using the default constructor for APMTracer, then the default
> behaviour
> >> should be to trace all - and setting the sampling.priority to 1 should
> not
> >> override that. Could you try not setting this tag to see if there is any
> >> difference?
> >>
> >> I see. Well, I am using the default constructor, and I have tried with
> and
> >> without sampling.priority=1 and it's the same situation either way.
> >>
> >>
> >>
> >>> (2) Is there any convenient way I can see, with increased logging or
> >>> something, which traces are actually being sent from the client, and
> which
> >>> are actually received by the server?
> >>
> >> You could initially check the traces stored in Elasticsearch using
> something
> >> like: curl http://localhost:9200/apm-hawkular/trace/_search | python -m
> >> json.tool
> >>
> >> Right now I have a repl launched with HAWKULAR_APM_LOG_LEVEL set to
> FINEST.
> >> I'm creating spans in the repl as described earlier. Each time I create
> a
> >> trace I see a log entry from the client like this:
> >>
> >> FINEST: [TracePublisherRESTClient] [Thread[pool-2-thread-1,5,main]]
> Status
> >> code is: 204
> >>
> >> and that 204 would suggest the trace info was successfully sent. But
> inside
> >> the docker container I can curl Elasticsearch and those new traces are
> not
> >> to be found.
> >>
> >> Incidentally, I started the repl last night, did a few successful
> tests, and
> >> then closed the lid of my laptop for the night with the Hawkular
> container
> >> still running and the repl still running. I've also had this issue occur
> >> immediately on launch of the repl, so I don't think it's specifically
> about
> >> long running repls and/or sleeping, but for completeness I thought I
> would
> >> clarify how I am running this.
> >>
> >>> Do you have a pure Java example that reproduces the same issue? Might
> be
> >>> worth creating a jira in https://issues.jboss.org/projects/HWKAPM to
> track
> >>> the issue.
> >>
> >> No, not yet...
> >>
> >> _______________________________________________
> >> hawkular-dev mailing list
> >> hawkular-dev(a)lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/hawkular-dev
> >>
> > _______________________________________________
> > hawkular-dev mailing list
> > hawkular-dev(a)lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hawkular-dev
>
7 years, 9 months
Hawkular APM and instrumenting clojure
by Neil Okamoto
As an experiment I'm instrumenting a service written in clojure using
opentracing-java. Through the clojure/java interop I've mostly succeeded
in getting trace information reported through to the Hawkular APM server.
I say "mostly succeeded" because sooner or later in every one of my hacking
sessions I get to the point where the spans I am creating in the app are no
longer reported in the web ui.
For convenience I'm using the Hawkular dev docker image
<https://hub.docker.com/r/jboss/hawkular-apm-server-dev>. In my test app
I'm doing nothing more than initializing an APMTracer
<https://github.com/hawkular/hawkular-apm/blob/master/client/opentracing/s...>
with
the appropriate environment variables set, and then calling
buildSpan("foo"), withTag("sampling.priority", 1), start(), sleep for a
while, and then finish(). Where all of the previous was done in clojure,
but I'm talking in pseudocode here just to make the intent clear.
So like I said, sometimes these traces are reported, other times they seem
to be silently dropped. I can't detect any consistent pattern how or why
this happens...
(1) Is using a "sampling.priority" of 1 merely advisory? It would explain
everything if those traces are meant to be dropped.
(2) Is there any convenient way I can see, with increased logging or
something, which traces are actually being sent from the client, and which
are actually received by the server?
7 years, 9 months
Docker image size does matter
by Jiri Kremser
Hello,
I was looking into google/cadvisor docker image that is only 47 megs
large and wondering how we can improve. To some extend it is so small
because of the Go lang, but not only.
Here are the results:
base image with JRE 8 and Alpine linux: 76.8 MB
wildfly 10.1.0.Final image 215 MB
hawkular-services 320 MB
Just for the record, here is status quo:
base CentOS image w/ JDK 8: 149 MB
wf image: 580 MB
hawkular-services image 672 MB
All the mini-images are based on Alpine (that itself is based on BusyBox),
so the price for it is less convenience when debugging the images.
I also removed
9.2M /opt/jboss/wildfly/docs
and wanted to remove
9.0M /opt/jboss/wildfly/modules/system/layers/base/org/hibernate
5.1M /opt/jboss/wildfly/modules/system/layers/base/org/apache/lucene
5.6M /opt/jboss/wildfly/modules/system/layers/base/org/apache/cxf
but from some reason the h-services fails to start because it didn't found
some class from that hibernate module, so I rather put it back.
What also helped was squashing all the image layers into 1. This makes the
download faster and possibly the image smaller. When applying docker-squash
[1] to the current h-services image it saves ~50megs
I am aware that this probably wont fly with some RH policy that we should
base our SW on Fedora/RHEL base OS images, but I am gonna use them for
development and because I often run out of space because of Docker.
Oh and I haven't published it on dockerhub yet, but the repo is here [2]
jk
[1]: https://github.com/goldmann/docker-squash
[2]: https://github.com/Jiri-Kremser/hawkular-services-mini-dockerfiles
7 years, 9 months