[Hawkular-dev] Availability revisited

Fri Jul 24 11:44:11 EDT 2015

_Part 1) Avail Storage_
There are issues no matter which way we go.  Keeping current-avail 
in-memory means having a global, distributed cache that in a worst case 
would have an entry replicated for every resource on every server.  But, 
it would allow us to keep only a RLE avail in cassandra and a current 
avail could be accessed in memory.

We can avoid a global, distributed cache by keeping avail in cassandra.  
Possibly rooted in the schema suggested by John below. That combines a 
verbose non-RLE table and another table that can provide the current 
avail as well as the most recent avail change. We can then have a 
server-side aggregator to migrate data to RLE. John and I discussed some 
of the challenges and John has some ideas around how the aggregation 
could work.

Either approach will come with some challenges.  We give up some control 
with the ispn cache and potentially would need to migrate to a 
standalone cache in the future.  Although getting started with an 
embedded cache may save effort over additional coding for aggregation to 
RLE.

_Part 2) Avail Sub-State_
We currently have the basic Availability states of Up, Down, and 
Unknown.  We also have the notion of "Degraded" and possibly "Mixed" 
(for groups).   Additionally,  we have ideas around additional modifiers 
like Missing, Disabled, MaintenanceWindow and RestartRequired.  I say 
"modifiers" instead of "sub-states" because maybe multiple modifiers 
could be in effect at a given time.

Going back to Availability, I wonder if maybe we should a consider a 
numeric (integer) value for Availability.  This could integrate 
Degraded/Mixed into a single Availability value.  Perhaps -1=Uknown, 
0=Down, [1..9] =Degraded/Mixed, 10=Up.   We don't want too many values 
or the 'avail_state_change' table could end up with too many rows per 
resource.  If we don't want to quantify the value then I think we could 
probably live with Up, Down, Unknown, Mixed.  And then have everything 
else be a modifier.  If we want to limit to a single modifier it may 
make sense to have a large set of Avail values, e.g. Up, Up_Degraded, 
Down, Down_Missing, etc...

Otherwise, I guess the Modifiers could be a verbose, comma separated 
string value.  Or it could be a compact bit mask.   It doesn't strike me 
as a field that needs to be searched/filtered on at the db level, more 
likely could be used for client-side filtering.  The question that needs 
to be answered is whether it's important for a historical RLE view, or 
if its important only for the current state.  If looking at a RLE for a 
resource's avail, and you see a Down period last week, do we need to be 
able to drill down and say it was Down and Missing?  Or Up with 
RestartRequired?  Or is that relevant only for the current avail state.  
I honestly don't know (but if we have a verbose list of avail values 
it's easy).   My inclination is to say that histyorical RLE is not 
decorated with modifiers.  In that case I think only the  
'avail_state_change' table would need an additional, non-PK field for 
the modifiers.

I'll stop rambling now and see what people have to say...

On 7/22/2015 11:55 PM, John Sanda wrote:
>> On Jul 20, 2015, at 5:13 AM, Heiko W.Rupp<hrupp at redhat.com>  wrote:
>>
>> On 16 Jul 2015, at 3:45, John Sanda wrote:
>>> Are there any docs, notes, etc. on the feed-comm system? I am not
>>> familiar with this.
>> Mazz should have /will talk about it. But at the end this has only
>> little
>> to do with this topic, as any sort of availability should work with
>> "normal" REST-based-feeds too.
>>
>>> Why do we need to store the last seen availability in memory?
>> Because we can? :-)
>> Seriously: in RHQ we had huge issues with all the availability records
>> that were never cached, so everything working with "current
>> availability)
>> had to go to the database with its latencies and costs.
> It is premature to say that we need to cache availability while we have only implemented simple read/write operations without any extensive performance/stress testing. I am aware that we had issues with availability in RHQ, but we stored availability in the RDBMS. I think we should wait and see how a Cassandra based implementation looks, do some performance testing, and then make a more informed decision about whether or not caching is needed.
>
>> Availability is by nature something run-length encoded. Unlike
>> counter or gauge metrics.
>> We could of course store each incoming new availability record, so that
>> its (start) timestamp would reflect the last seen time, but querying for
>> "since when was it up" would result in a pretty heavy backtracking query
>> (with some luck we have a limit like "in the last hour/day", but what if
>> we want an absolute date or over the last year.
> The better we understand the queries we need to support, then we can better design the schema to optimize for those queries. Suppose we store availability as follows
>
> CREATE TABLE availability (
>      tenant text,
>      id text,
>      bucket timestamp,
>      time timestamp,
>      value text,    // stored here as text for simplicity
>      PRIMARY KEY ((tenant, id, bucket), time)
> ) WITH CLUSTERING ORDER BY (time DESC);
>
> The bucket column is for date partitioning to ensure partitions do not grow too large. If we do not collect availability more frequently than say every minute, we might be fine with using bucket sizes of 6 hrs, 12 hrs, or even 24 hrs.
>
> We will introduce an additional table to help us answer “since when it was up”,
>
> CREATE TABLE avail_state_change (
>      tenant text,
>      avail_id text,
>      value text,
>      time timestamp,
>      PRIMARY KEY ((tenant, avail_id), value)
> );
>
> The schema will actually allow us to answer the more general question, “the avail state was X since when.” The avail_state_change table will store at most one row for each recorded availability state for a particular resource. When availability is reported, we write to both tables. To answer our query we can execute the following queries in parallel,
>
> SELECT value, timestamp FROM availability WHERE tenant = ? AND id = ? and bucket = ? LIMIT 1;
>
> SELECT value, time FROM avail_state_change WHERE tenant_id = ? and avail_id = ?;
>
> If the value returned from the availability table query is UP, then we compare its timestamp against the timestamp of the DOWN row returned from the avail_state_change query.
>
> We need to play around with schema changes like this to see if it will satisfy query performance requirements. And then if it doesn’t, we should look at tweaking the key cache, and then look at the row cache, and then finally look at a separate caching tier.
>
> We can implement background aggregation jobs to store run-length encoded values that will allow to efficiently answer similar queries for the more distant past.
>
>> This is why I am thinking about keeping the RLE with start/stop/state,
>> but augmented by "last seen" for the in-memory version.
>>
>> Keeping last seen in memory prevents all the expensive backend-hits
>> (either getting the same value over and over again, or doing in-place
>> updates)
>> and still allows jobs to check if the last-seen is e.g. within the last
>> minute
>> and react accordingly (RHQ-term: "backfill”).
> Again, I think that the schema that I have described above allows us to handle this efficiently.
>
>>> If you talking about correlation, then I am +1. When I think about
>>> RHQ, the user could easily see availability state change, but he would
>>> have to go hunting around to see what precipitated it.
>> This is certainly another aspect of this "root cause analysis".
>>
>>    Heiko
>> _______________________________________________
>> hawkular-dev mailing list
>> hawkular-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hawkular-dev
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hawkular-dev/attachments/20150724/9c57280f/attachment-0001.html