[metrics] configurable data retention
by John Sanda
Pretty much from the start of the project we have provided configurable data retention. There is a system-wide default retention that can be set at start up. You can also set the data retention per tenant as well as per individual metric. Do we need to provide this fine-grained level of configurability, or is it sufficient to only have a system-wide data retention which is configurable?
It is worth noting that in OpenShift *only* the system-wide data retention is set. Recently we have been dealing with a number of production issues including:
* Cassandra crashing with an OutOfMemoryError
* Stats queries failing in Hawkular Metrics due to high read latencies in Cassandra
* Expired data not getting purged in a timely fashion
These issues all involve compaction. In older versions of Hawkular Metrics we were using the default, size tiered compaction strategy (STCS). Time window compaction strategy (TWCS) is better suited and for time series data such as our. We are already seeing good results with some early testing. Using the correct and properly configured compaction strategy can have a significant impact on several things including:
* I/O usage
* cpu usage
* read performance
* disk usage
TWCS was developed for some very specific use cases which are common use cases with Cassandra. TWCS is recommended for time series that meet the following criteria:
* append-only writes
* no deletes
* global (i.e., table-wide) TTL
* few out of order writes (at least it is the exception and not the norm)
It is the third bullet which has prompted this email. If we allow/support different TTLs per tenant and/or per metric we will lose a lot of the benefits of TWCS and likely continue to struggle with some of the issues we have been facing as of late. If you ask me exactly how well or poorly will compaction perform using mixed TTLs, I can only speculate. I simply do not have the bandwidth to test things that C* docs and C* devs say *not* to do.
I am of the opinion, at least for OpenShift users, that disk usage is much more important than fine-grained data retentions. A big question I have is, what about outside of OpenShift? This may be a question for some people not on this list, so I want to make sure it does reach the right people.
I think we could potentially tie together configurable data retention with rollups. Let’s say we add support for 15min, 1hr, 6hr, and 24hr rollups where each rollup is stored in its own table and each larger rollup having a larger retention. Different levels of data retentions could be used to determine what rollups a tenant has. If a tenant wants a data retention of a month for example, then that could translate into generating 15min and 1hr rollups for that tenant.
- John
9 years, 2 months
Upgrade to Wildfly 1.1.0.Final
by Lucas Ponce
Hello,
Is there any objection / potential problem if we upgrade from 1.0.0.Final to 1.1.0.Final ?
During investigation of a clustering issue, there are some fixes related that seems to be packaged on 1.1.0.Final.
I am still working on this but I would want to know if upgrading the Wildfly version in parent could have consensus.
Thanks,
Lucas
9 years, 2 months
hosa - /metrics can be behind auth; 2 new metrics
by John Mazzitelli
[this is more for Matt W, but will post here]
Two new things in HOSA - these have been released under the 1.2.0.Final version and is available on docker hub - see https://hub.docker.com/r/hawkular/hawkular-openshift-agent/tags/
1) Hawkular WildFly Agent has its own metrics endpoint (so it can monitor itself). The endpoint is /metrics. This is nothing new.
But the /metrics can now be configured behind basic auth. If you configure this in the agent config, you must authenticate to see the metrics:
emitter:
metrics_credentials:
username: foo
password: bar
You can pass these in via env. vars and thus you can use OpenShift secrets for it.
2) There are now two new metrics (both gauges) the agent itself emits:
hawkular_openshift_agent_monitored_pods (The number of pods currently being monitored)
hawkular_openshift_agent_monitored_endpoints (The number of endpoints currently being monitored)
That is all.
9 years, 2 months
Re: [Hawkular-dev] Hawkular APM and instrumenting clojure
by Neil Okamoto
Since this morning I've had a server running inside docker on a separate
machine with more installed memory. I haven't seen any problems since then.
In retrospect I wish I had thought of this sooner.
For now I'm moving on from this problem to do a more complete
instrumentation of the clojure app. I'll keep an eye open for further
problems and I'll report back if there's anything noteworthy.
thanks Gary,
Neil
On Fri, Feb 3, 2017 at 7:53 AM, Neil Okamoto <neil.okamoto(a)gmail.com> wrote:
> Thanks Gary. I'll try running the server outside docker, but before I do
> that I'm going to run the container on a machine with more memory.
>
> > On Feb 3, 2017, at 7:15 AM, Gary Brown <gbrown(a)redhat.com> wrote:
> >
> > Hi Neil
> >
> > Sounds strange. Would it be possible to try running the server outside
> docker to see if there may be issues there.
> >
> > If you create the jira with reproducer then we will investigate aswell.
> >
> > Thanks for the additional info.
> >
> > Regards
> > Gary
> >
> > ----- Original Message -----
> >> Thanks Gary.
> >>
> >> On Fri, Feb 3, 2017 at 1:54 AM, Gary Brown < gbrown(a)redhat.com > wrote:
> >>
> >>
> >>
> >>> (1) Is using a "sampling.priority" of 1 merely advisory? It would
> explain
> >>> everything if those traces are meant to be dropped.
> >>
> >> If using the default constructor for APMTracer, then the default
> behaviour
> >> should be to trace all - and setting the sampling.priority to 1 should
> not
> >> override that. Could you try not setting this tag to see if there is any
> >> difference?
> >>
> >> I see. Well, I am using the default constructor, and I have tried with
> and
> >> without sampling.priority=1 and it's the same situation either way.
> >>
> >>
> >>
> >>> (2) Is there any convenient way I can see, with increased logging or
> >>> something, which traces are actually being sent from the client, and
> which
> >>> are actually received by the server?
> >>
> >> You could initially check the traces stored in Elasticsearch using
> something
> >> like: curl http://localhost:9200/apm-hawkular/trace/_search | python -m
> >> json.tool
> >>
> >> Right now I have a repl launched with HAWKULAR_APM_LOG_LEVEL set to
> FINEST.
> >> I'm creating spans in the repl as described earlier. Each time I create
> a
> >> trace I see a log entry from the client like this:
> >>
> >> FINEST: [TracePublisherRESTClient] [Thread[pool-2-thread-1,5,main]]
> Status
> >> code is: 204
> >>
> >> and that 204 would suggest the trace info was successfully sent. But
> inside
> >> the docker container I can curl Elasticsearch and those new traces are
> not
> >> to be found.
> >>
> >> Incidentally, I started the repl last night, did a few successful
> tests, and
> >> then closed the lid of my laptop for the night with the Hawkular
> container
> >> still running and the repl still running. I've also had this issue occur
> >> immediately on launch of the repl, so I don't think it's specifically
> about
> >> long running repls and/or sleeping, but for completeness I thought I
> would
> >> clarify how I am running this.
> >>
> >>> Do you have a pure Java example that reproduces the same issue? Might
> be
> >>> worth creating a jira in https://issues.jboss.org/projects/HWKAPM to
> track
> >>> the issue.
> >>
> >> No, not yet...
> >>
> >> _______________________________________________
> >> hawkular-dev mailing list
> >> hawkular-dev(a)lists.jboss.org
> >> https://lists.jboss.org/mailman/listinfo/hawkular-dev
> >>
> > _______________________________________________
> > hawkular-dev mailing list
> > hawkular-dev(a)lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hawkular-dev
>
9 years, 2 months
Hawkular APM and instrumenting clojure
by Neil Okamoto
As an experiment I'm instrumenting a service written in clojure using
opentracing-java. Through the clojure/java interop I've mostly succeeded
in getting trace information reported through to the Hawkular APM server.
I say "mostly succeeded" because sooner or later in every one of my hacking
sessions I get to the point where the spans I am creating in the app are no
longer reported in the web ui.
For convenience I'm using the Hawkular dev docker image
<https://hub.docker.com/r/jboss/hawkular-apm-server-dev>. In my test app
I'm doing nothing more than initializing an APMTracer
<https://github.com/hawkular/hawkular-apm/blob/master/client/opentracing/s...>
with
the appropriate environment variables set, and then calling
buildSpan("foo"), withTag("sampling.priority", 1), start(), sleep for a
while, and then finish(). Where all of the previous was done in clojure,
but I'm talking in pseudocode here just to make the intent clear.
So like I said, sometimes these traces are reported, other times they seem
to be silently dropped. I can't detect any consistent pattern how or why
this happens...
(1) Is using a "sampling.priority" of 1 merely advisory? It would explain
everything if those traces are meant to be dropped.
(2) Is there any convenient way I can see, with increased logging or
something, which traces are actually being sent from the client, and which
are actually received by the server?
9 years, 2 months
Docker image size does matter
by Jiri Kremser
Hello,
I was looking into google/cadvisor docker image that is only 47 megs
large and wondering how we can improve. To some extend it is so small
because of the Go lang, but not only.
Here are the results:
base image with JRE 8 and Alpine linux: 76.8 MB
wildfly 10.1.0.Final image 215 MB
hawkular-services 320 MB
Just for the record, here is status quo:
base CentOS image w/ JDK 8: 149 MB
wf image: 580 MB
hawkular-services image 672 MB
All the mini-images are based on Alpine (that itself is based on BusyBox),
so the price for it is less convenience when debugging the images.
I also removed
9.2M /opt/jboss/wildfly/docs
and wanted to remove
9.0M /opt/jboss/wildfly/modules/system/layers/base/org/hibernate
5.1M /opt/jboss/wildfly/modules/system/layers/base/org/apache/lucene
5.6M /opt/jboss/wildfly/modules/system/layers/base/org/apache/cxf
but from some reason the h-services fails to start because it didn't found
some class from that hibernate module, so I rather put it back.
What also helped was squashing all the image layers into 1. This makes the
download faster and possibly the image smaller. When applying docker-squash
[1] to the current h-services image it saves ~50megs
I am aware that this probably wont fly with some RH policy that we should
base our SW on Fedora/RHEL base OS images, but I am gonna use them for
development and because I often run out of space because of Docker.
Oh and I haven't published it on dockerhub yet, but the repo is here [2]
jk
[1]: https://github.com/goldmann/docker-squash
[2]: https://github.com/Jiri-Kremser/hawkular-services-mini-dockerfiles
9 years, 2 months