On Jul 20, 2015, at 5:13 AM, Heiko W.Rupp <hrupp(a)redhat.com>
wrote:
On 16 Jul 2015, at 3:45, John Sanda wrote:
> Are there any docs, notes, etc. on the feed-comm system? I am not
> familiar with this.
Mazz should have /will talk about it. But at the end this has only
little
to do with this topic, as any sort of availability should work with
"normal" REST-based-feeds too.
>
> Why do we need to store the last seen availability in memory?
Because we can? :-)
Seriously: in RHQ we had huge issues with all the availability records
that were never cached, so everything working with "current
availability)
had to go to the database with its latencies and costs.
It is premature to say that we need to cache availability while we have only implemented
simple read/write operations without any extensive performance/stress testing. I am aware
that we had issues with availability in RHQ, but we stored availability in the RDBMS. I
think we should wait and see how a Cassandra based implementation looks, do some
performance testing, and then make a more informed decision about whether or not caching
is needed.
Availability is by nature something run-length encoded. Unlike
counter or gauge metrics.
We could of course store each incoming new availability record, so that
its (start) timestamp would reflect the last seen time, but querying for
"since when was it up" would result in a pretty heavy backtracking query
(with some luck we have a limit like "in the last hour/day", but what if
we want an absolute date or over the last year.
The better we understand the queries we need to support, then we can better design the
schema to optimize for those queries. Suppose we store availability as follows
CREATE TABLE availability (
tenant text,
id text,
bucket timestamp,
time timestamp,
value text, // stored here as text for simplicity
PRIMARY KEY ((tenant, id, bucket), time)
) WITH CLUSTERING ORDER BY (time DESC);
The bucket column is for date partitioning to ensure partitions do not grow too large. If
we do not collect availability more frequently than say every minute, we might be fine
with using bucket sizes of 6 hrs, 12 hrs, or even 24 hrs.
We will introduce an additional table to help us answer “since when it was up”,
CREATE TABLE avail_state_change (
tenant text,
avail_id text,
value text,
time timestamp,
PRIMARY KEY ((tenant, avail_id), value)
);
The schema will actually allow us to answer the more general question, “the avail state
was X since when.” The avail_state_change table will store at most one row for each
recorded availability state for a particular resource. When availability is reported, we
write to both tables. To answer our query we can execute the following queries in
parallel,
SELECT value, timestamp FROM availability WHERE tenant = ? AND id = ? and bucket = ? LIMIT
1;
SELECT value, time FROM avail_state_change WHERE tenant_id = ? and avail_id = ?;
If the value returned from the availability table query is UP, then we compare its
timestamp against the timestamp of the DOWN row returned from the avail_state_change
query.
We need to play around with schema changes like this to see if it will satisfy query
performance requirements. And then if it doesn’t, we should look at tweaking the key
cache, and then look at the row cache, and then finally look at a separate caching tier.
We can implement background aggregation jobs to store run-length encoded values that will
allow to efficiently answer similar queries for the more distant past.
This is why I am thinking about keeping the RLE with start/stop/state,
but augmented by "last seen" for the in-memory version.
Keeping last seen in memory prevents all the expensive backend-hits
(either getting the same value over and over again, or doing in-place
updates)
and still allows jobs to check if the last-seen is e.g. within the last
minute
and react accordingly (RHQ-term: "backfill”).
Again, I think that the schema that I have described above allows us to handle this
efficiently.
> If you talking about correlation, then I am +1. When I think about
> RHQ, the user could easily see availability state change, but he would
> have to go hunting around to see what precipitated it.
This is certainly another aspect of this "root cause analysis".
Heiko
_______________________________________________
hawkular-dev mailing list
hawkular-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hawkular-dev