[Hawkular-dev] [Metrics] Evaluating performance of #groupBy vs #window for stats computation

Thursday, 27 August 2015

Hi,

Recently, we've ported to RxJava the stats computation code, a part not 
fully "reactive" yet. The intention was to be able to compute statistics 
on raw data for large period of times, with low requirements on memory 
(the initial implementation needed all data loaded at once in a list).

If you're interested in the $subject and don't known what stats 
computation is, it's an API metrics provides to give the user stats 
(min/max/avg/median/95th percentile) on raw data. The stats are given in 
buckets (portions of time). As an example, it's the API you use if you 
want monthly stats on a year of data, or hourly stats for a day, ... etc.

The implementation makes use of the Observable#groupBy operator. This 
allows to determine the bucket each data point belongs to, and then we 
#collect the points in each bucket to compute the stats.
https://github.com/hawkular/hawkular-metrics/blob/master/core/metrics-cor...

As we were talking on a related topic (stats for sliding windows), John 
wondered if a solution based on Observable#window would perform equally.

So I've implemented this solution and instrumented the code to determine 
where the system was spending the execution time. Then I tried both 
solutions with different bucket sizes, metric resolutions, numbers of 
buckets.

Here are my observations.

#groupBy and #window behave about the same in our case (I've tried with 
a few buckets up to a thousand buckets)

Response time increases linearly with the number of data points loaded 
from the C*. Response time here means time elapsed between the moment 
Wildfly invokes the JAX-RS handler method and we resume the AsyncResponse.

The bottleneck is the data loading and mapping: 95% of execution time. 
C* Row to object mapping is nearly 50% of it.

And my conclusions.

When working with non overlapping buckets (not sliding windows), 
readability should determine the use of #groupBy or #window, not 
performance.

It seems possible to get a better response time. Currently we transform 
a ResultSet into an Observable<Row> with a simple call to 
Observable#from (ResultSet is an Iterable<Row>). By default, the C* 
driver fetches a page of data, and only when it's entirely consumed, 
fetches the following one. But ResultSet has a #fetchMoreResults method 
which we could use to fetch pages ahead of time (see 
http://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/Res...). 
Then, while Rx computation threads would spend time on mapping a Row to 
a DataPoint<T>, the C* driver could load more Rows.

Attached are:
- a patch of the changes implemented for testing
- some of the results

Regards,
Thomas

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

[Hawkular-dev] [Metrics] Evaluating performance of #groupBy vs #window for stats computation