[Hawkular-dev] Low-impact clients and never dropping any events

Fri Feb 13 12:43:42 EST 2015

Thanks for the response, Thomas. I have more questions inline. Thanks for humoring me.

> On Feb 13, 2015, at 10:49 AM, Thomas Segismont <tsegismo at redhat.com> wrote:
> 
> Hi Randall,
> 
> Answers inline.
> 
> Le 13/02/2015 16:12, Randall Hauch a écrit :
>> Forgive my ignorance, but I’m new to the list and I didn’t see anything in the archives about $subject, detailed below. Lately I’ve been very interested in several topics ancillary to monitoring, so I’m quite intrigued by the planned direction and approach.
>> 
>> How do clients/systems/services that are to be monitored actually send their monitorable information? What is the granularity of this information: is it already summarized or somewhat aggregated in the client, or is it very low-level and fine-grained events? What is the impact on the client of adding this extra overhead?
> 
> There are different options for sending:
> 
> 1. External collectors
> A collector running as independent process queries the monitored system, 
> which exposes, somehow, runtime information. Then the collector sends 
> the information Hawkular.
> Examples: rhq agent, collectd, jmxtrans
> 
> 2. Embedded collectors
> Same as above, except that the collector runs in the same process as the 
> monitored system.
> Examples: Wildfly monitor, embedded-jmxtrans, codahale metrics (if 
> configured with a reporter other than JMX)
> 
> 3. Custom
> Any solution which sends information to Hawkular without resorting to a 
> collector.
> 
> Granularity is not enforced: at different points in time, you could send 
> the values of a counter or send a locally computed derivative for the 
> last minute.

What do the collectors submit? It seems like there are two options:

a) periodically capture metrics; or
b) capture every “event of interest” whenever it occurs

IIUC, monitoring something like JMX-enabled system would likely be (a), but (b) is really where the value is. Yes, (b) is more invasive but it let’s you capture every possible activity and derive metrics accurately without losing transient spikes.

The problem with (a) is that you might miss important cases, since each captured metric represents a measurement at a single instant in time. Consider monitoring a system periodically (e.g., every 15 seconds) to obtain some metric (e.g., the size of a db connection pool, etc). If a spike occurs and is resolved *within a single interval*, then the captured metric will never reflect this, and any derived aggregate (max, min, aggregate) will also not reflect the spike.

Examples of (b) include low-level system information like "create database connection”, “closed database connection”, “start new thread”, “stop thread”, etc., but it also includes recording business-related events, such as “received request from user X from IP address A to read resource R”. The latter is insanely valuable, because customers can build downstream services that consume Hawkular output and get valuable business insight and perform business analytics. Those custom services can do all kinds of interesting things, such as automatically rate-limit requests from a specific user, bill based upon usage, audit access for compliance purposes, prevent DoS by rejecting requests from suspect IP addresses, etc.

This is what is called “event sourcing”, and it records all interesting events and forms the basis for a lot of near-real-time analytics and monitoring solutions. And, an architecture designed for this purpose can do things will persist every event in a persistent transaction log, and services will consume this stream/log, do work, and output to other persistent logs/streams. These streams and services form a workflow, but it’s not quite the same as a traditional pub/sub set of services because each service consume the events at their own pace. The net effect is that you can take a service down for maintenance (or it can crash), and when it comes back up it will simply continue where it left off. Upstream services are unaffected, and downstream services just see a lull in events but otherwise are unaffected. Plus, you never lose any events or any data, and all aggregates and windowing (done in downstream services) are entirely accurate. You can even set up separate streams for windowed metrics (e.g., minute, hour, day, week, etc.)

Here’s an article that describes how LinkedIn is using stream-oriented processing to capture metrics and events for their systems, and how they derive benefit for the business: http://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samza <http://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samza>. They process 500 billion events per day (yes, billion). Here are a couple of other links: http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/ <http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/>, http://www.slideshare.net/InfoQ/samza-in-linkedin-how-linkedin-processes-billions-of-events-everyday-in-realtime <http://www.slideshare.net/InfoQ/samza-in-linkedin-how-linkedin-processes-billions-of-events-everyday-in-realtime>, http://scale-out-blog.blogspot.co.uk/2014/02/why-arent-all-data-immutable.html <http://scale-out-blog.blogspot.co.uk/2014/02/why-arent-all-data-immutable.html>, and the ever-essential http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying <http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying>

> 
> The impact when sending information with a collector or embedded 
> collector should be pretty low. Most of existing solutions do some sort 
> of batching. With a custom sender, it all depends on how it's designed, 
> obviously.

Sounds good.

> 
>> 
>> Do you have an estimate or goal for how much volume of incoming data can be handled without impacting on clients? What, if anything, does a client submission wait for on the back-end?
> 
> Hawkular metrics is designed to be horizontally scalable so the volume 
> of data you can absorb should depend on the number of machines you can 
> throw in the game.

How will the work be split up across the cluster? Are the incoming messages stored before they’re processed? 

> 
> Most collectors buffer data to be sent and operate in separate threads. 
> So if the metrics ingestion rate decreases, they'll consume more memory. 
> Other than that, it should have limited impact on your service.

Sounds good.

> 
>> 
>> Also, how do you plan to ensure that, no matter what happens to the Hawkular system or anything it depends upon, no client information is every lost or dropped?
> 
> Usually collectors will drop data once buffers are full. If you want to 
> make sure no data is lost, then you need to build a custom sender. 
> Hawkular metrics has an HTTP interface so the response code should tell 
> you if a metric was successfully persisted.

So I understand that the collectors will drop any buffered data, since that will be unsent if the monitored system (or external collectors) crash. But what happens if Hawkular suffers a partial or total crash? What data is lost? What happens to data that was in-process when the crash occurred? Stream-based architectures are interesting because they often can handle more load, partition it more effectively, and are more durable.

> 
>> 
>> Finally, is the plan to make Hawkular embeddable (excluding the stuff that has to be embedded in monitored clients/systems/services), or only a separate turn-key (i.e., install-and-run-and-use) system?
> 
> Hawkular metrics comes in two forms:
> * a Java library (metrics-core)
> * a Java EE web application (built on top of the library)
> 
> metrics-core can be embedded in any sort of JVM application but it 
> expects to find a Cassandra cluster somewhere.
> 
> 
> 
> I hope it helps. Feel free to ask for details.
> 
> And welcome to Hawkular!
> 
> Thomas
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hawkular-dev/attachments/20150213/46853ebf/attachment-0001.html