Hi,
Recently, we've ported to RxJava the stats computation code, a part not
fully "reactive" yet. The intention was to be able to compute statistics
on raw data for large period of times, with low requirements on memory
(the initial implementation needed all data loaded at once in a list).
If you're interested in the $subject and don't known what stats
computation is, it's an API metrics provides to give the user stats
(min/max/avg/median/95th percentile) on raw data. The stats are given in
buckets (portions of time). As an example, it's the API you use if you
want monthly stats on a year of data, or hourly stats for a day, ... etc.
The implementation makes use of the Observable#groupBy operator. This
allows to determine the bucket each data point belongs to, and then we
#collect the points in each bucket to compute the stats.
https://github.com/hawkular/hawkular-metrics/blob/master/core/metrics-cor...
As we were talking on a related topic (stats for sliding windows), John
wondered if a solution based on Observable#window would perform equally.
So I've implemented this solution and instrumented the code to determine
where the system was spending the execution time. Then I tried both
solutions with different bucket sizes, metric resolutions, numbers of
buckets.
Here are my observations.
#groupBy and #window behave about the same in our case (I've tried with
a few buckets up to a thousand buckets)
Response time increases linearly with the number of data points loaded
from the C*. Response time here means time elapsed between the moment
Wildfly invokes the JAX-RS handler method and we resume the AsyncResponse.
The bottleneck is the data loading and mapping: 95% of execution time.
C* Row to object mapping is nearly 50% of it.
And my conclusions.
When working with non overlapping buckets (not sliding windows),
readability should determine the use of #groupBy or #window, not
performance.
It seems possible to get a better response time. Currently we transform
a ResultSet into an Observable<Row> with a simple call to
Observable#from (ResultSet is an Iterable<Row>). By default, the C*
driver fetches a page of data, and only when it's entirely consumed,
fetches the following one. But ResultSet has a #fetchMoreResults method
which we could use to fetch pages ahead of time (see
http://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/Res...).
Then, while Rx computation threads would spend time on mapping a Row to
a DataPoint<T>, the C* driver could load more Rows.
Attached are:
- a patch of the changes implemented for testing
- some of the results
Regards,
Thomas