[Hawkular-dev] Low-impact clients and never dropping any events

Thomas Segismont tsegismo at redhat.com
Mon Feb 16 05:03:49 EST 2015


Le 13/02/2015 18:43, Randall Hauch a écrit :
> Thanks for the response, Thomas. I have more questions inline. Thanks
> for humoring me.
>
>> On Feb 13, 2015, at 10:49 AM, Thomas Segismont <tsegismo at redhat.com
>> <mailto:tsegismo at redhat.com>> wrote:
>>
>> Hi Randall,
>>
>> Answers inline.
>>
>> Le 13/02/2015 16:12, Randall Hauch a écrit :
>>> Forgive my ignorance, but I’m new to the list and I didn’t see
>>> anything in the archives about $subject, detailed below. Lately I’ve
>>> been very interested in several topics ancillary to monitoring, so
>>> I’m quite intrigued by the planned direction and approach.
>>>
>>> How do clients/systems/services that are to be monitored actually
>>> send their monitorable information? What is the granularity of this
>>> information: is it already summarized or somewhat aggregated in the
>>> client, or is it very low-level and fine-grained events? What is the
>>> impact on the client of adding this extra overhead?
>>
>> There are different options for sending:
>>
>> 1. External collectors
>> A collector running as independent process queries the monitored system,
>> which exposes, somehow, runtime information. Then the collector sends
>> the information Hawkular.
>> Examples: rhq agent, collectd, jmxtrans
>>
>> 2. Embedded collectors
>> Same as above, except that the collector runs in the same process as the
>> monitored system.
>> Examples: Wildfly monitor, embedded-jmxtrans, codahale metrics (if
>> configured with a reporter other than JMX)
>>
>> 3. Custom
>> Any solution which sends information to Hawkular without resorting to a
>> collector.
>>
>> Granularity is not enforced: at different points in time, you could send
>> the values of a counter or send a locally computed derivative for the
>> last minute.
>
> What do the collectors submit? It seems like there are two options:
>
> a) periodically capture metrics; or
> b) capture every “event of interest” whenever it occurs

The collectors I've talked about earlier fall in category (a).

>
> IIUC, monitoring something like JMX-enabled system would likely be (a),
> but (b) is really where the value is. Yes, (b) is more invasive but it
> let’s you capture every possible activity and derive metrics accurately
> without losing transient spikes.
>

You need (a) *and* (b) to build a comprehensive view of what is/was 
going on your system.


RHQ has always been (a) only. Hawkular will focus on (a) initially.

> The problem with (a) is that you might miss important cases, since each
> captured metric represents a measurement at a single instant in time.
> Consider monitoring a system periodically (e.g., every 15 seconds) to
> obtain some metric (e.g., the size of a db connection pool, etc). If a
> spike occurs and is resolved *within a single interval*, then the
> captured metric will never reflect this, and any derived aggregate (max,
> min, aggregate) will also not reflect the spike.
>
> Examples of (b) include low-level system information like "create
> database connection”, “closed database connection”, “start new thread”,
> “stop thread”, etc., but it also includes recording business-related
> events, such as “received request from user X from IP address A to read
> resource R”. The latter is insanely valuable, because customers can
> build downstream services that consume Hawkular output and get valuable
> business insight and perform business analytics. Those custom services
> can do all kinds of interesting things, such as automatically rate-limit
> requests from a specific user, bill based upon usage, audit access for
> compliance purposes, prevent DoS by rejecting requests from suspect IP
> addresses, etc.
>
> This is what is called “event sourcing”, and it records all interesting
> events and forms the basis for a lot of near-real-time analytics and
> monitoring solutions. And, an architecture designed for this purpose can
> do things will persist every event in a persistent transaction log, and
> services will consume this stream/log, do work, and output to other
> persistent logs/streams. These streams and services form a workflow, but
> it’s not quite the same as a traditional pub/sub set of services because
> each service consume the events at their own pace. The net effect is
> that you can take a service down for maintenance (or it can crash), and
> when it comes back up it will simply continue where it left off.
> Upstream services are unaffected, and downstream services just see a
> lull in events but otherwise are unaffected. Plus, you never lose any
> events or any data, and all aggregates and windowing (done in downstream
> services) are entirely accurate. You can even set up separate streams
> for windowed metrics (e.g., minute, hour, day, week, etc.)
>
> Here’s an article that describes how LinkedIn is using stream-oriented
> processing to capture metrics and events for their systems, and how they
> derive benefit for the business:
> http://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samza.
> They process 500 billion events per day (yes, billion). Here are a
> couple of other links:
> http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/,
> http://www.slideshare.net/InfoQ/samza-in-linkedin-how-linkedin-processes-billions-of-events-everyday-in-realtime,
> http://scale-out-blog.blogspot.co.uk/2014/02/why-arent-all-data-immutable.html,
> and the ever-essential
> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
>

We have plans for "events" storage in Hawkular metrics. You can already 
add tags to your numeric data, but without "events" you can't build 
something like you describe above.

>
>>
>> The impact when sending information with a collector or embedded
>> collector should be pretty low. Most of existing solutions do some sort
>> of batching. With a custom sender, it all depends on how it's designed,
>> obviously.
>
> Sounds good.
>
>>
>>>
>>> Do you have an estimate or goal for how much volume of incoming data
>>> can be handled without impacting on clients? What, if anything, does
>>> a client submission wait for on the back-end?
>>
>> Hawkular metrics is designed to be horizontally scalable so the volume
>> of data you can absorb should depend on the number of machines you can
>> throw in the game.
>
> How will the work be split up across the cluster? Are the incoming
> messages stored before they’re processed?
>

An Hawkular metrics server is web server (Wildfly) in front of a 
Cassandra cluster.

So you'd distribute the work with a load balancer.

I'm not sure I understand your second question.

>>
>> Most collectors buffer data to be sent and operate in separate threads.
>> So if the metrics ingestion rate decreases, they'll consume more memory.
>> Other than that, it should have limited impact on your service.
>
> Sounds good.
>
>>
>>>
>>> Also, how do you plan to ensure that, no matter what happens to the
>>> Hawkular system or anything it depends upon, no client information is
>>> every lost or dropped?
>>
>> Usually collectors will drop data once buffers are full. If you want to
>> make sure no data is lost, then you need to build a custom sender.
>> Hawkular metrics has an HTTP interface so the response code should tell
>> you if a metric was successfully persisted.
>
> So I understand that the collectors will drop any buffered data, since
> that will be unsent if the monitored system (or external collectors)
> crash. But what happens if Hawkular suffers a partial or total crash?
> What data is lost? What happens to data that was in-process when the
> crash occurred? Stream-based architectures are interesting because they
> often can handle more load, partition it more effectively, and are more
> durable.

If you lose Wildfly servers or Cassandra nodes then the rest of cluster 
should continue to work.

As a Hawkular metrics client, you can't say what happened to your data 
if you lose the connection before you get a response with a status code. 
When a Wildfly server crashes, your data may be still being unmarshaled, 
or might have reached the Cassandra driver, or have been written to the 
Cassandra logs.

We have no plan ATM to implement log-based event processors ourselves. 
In the beginning, we'll focus on existing and wide-spread collectors 
(generally buffer-based) and in-house collectors like the wildfly-monitor.

>
>>
>>>
>>> Finally, is the plan to make Hawkular embeddable (excluding the stuff
>>> that has to be embedded in monitored clients/systems/services), or
>>> only a separate turn-key (i.e., install-and-run-and-use) system?
>>
>> Hawkular metrics comes in two forms:
>> * a Java library (metrics-core)
>> * a Java EE web application (built on top of the library)
>>
>> metrics-core can be embedded in any sort of JVM application but it
>> expects to find a Cassandra cluster somewhere.
>>
>>
>>
>> I hope it helps. Feel free to ask for details.
>>
>> And welcome to Hawkular!
>>
>> Thomas
>> _______________________________________________
>> hawkular-dev mailing list
>> hawkular-dev at lists.jboss.org <mailto:hawkular-dev at lists.jboss.org>
>> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>



More information about the hawkular-dev mailing list