[Hawkular-dev] C* schema design

Tue Jun 30 16:42:30 EDT 2015

Before getting to Alerts, maybe we should touch a little on the 
persisted data in the model.  At the moment everything is pretty much 
partitioned on TenantId.   For example:

CREATE TABLE ${keyspace}.triggers (
     tenantId text,
     // other fields...
     id text,
     PRIMARY KEY (tenantId, id)
)

In general the number of Triggers will be relatively small, likely 
maxing out in the thousands.  It could feasibly be a very small number.  
Queries are usually going to be on a specific Trigger.  But queries on 
all triggers or some cross-section (likely based on the results of a 
prior Tag search) are possible.  But all searches will likely be within 
a tenant.  Leaving it as is would give us single-partition searches.  
Data would only be spread cross-partition if multiple tenants are in 
play.  I'd say this is likely OK as is, or we could try and spread 
things across partitions by manufacturing an additional field, like 
"idbucket" that was a modulo of the id hash.   That could ensure a few 
partitions even if there is only one tenant.

Whether we leave it as is or tweak it with a bucket, I would think the 
trigger-actions, conditions, dampening and tags that hang off a trigger 
should probably have the same partitionId as the trigger. They would get 
the same distribution and should only hit one partition on a query, 
which typically goes through the owning trigger.

Actions are queried by tenantId and actionPlugin.  Again, the low 
cardinality would probably make the current tenantId partitionId OK, but 
we could also use (tenantId,pluginName) and get distribution + 
single-partition querying.

Tags are a little funny because the category field is optional but is 
also something that could be queried.  When searching for Triggers the 
queries are likely (tenantId,name) or (tenantId,name,category).  When 
displaying a Trigger it would be (tenantId,triggerId).   In this case 
it's probably best to store full tags in two ways, with a clustering key 
of name, and a clustering key of triggerId.  This is basically what we 
have already.  We have 'name' and 'triggerId,name'  We may be able to 
drop the 'name' segment  from that last one.  We have a secondary index 
on category for both, I think we may be able to drop the one on tags 
table and keep it only on tags_triggers table.

Alerts are of course the big thing.  Although millions of alerts is 
probably unlikely (certainly unwanted), the number is hard to predict.  
But it will be the highest cardinality by far, as single triggers can 
generate many alerts.  The thing about alert querying, I think, is that 
it is the most recent, unresolved alerts, that are very likely the most 
interesting.  Once an alert is resolved, and certainly once it is 
resolved and old, its usefulness typically goes down considerably.  At 
that point queries would be more for aggregate reporting, etc.

I don't think in a well-behaved system that a trigger would generate 
thousands of alerts in a day.  It certainly could generate thousands, or 
more, in a lifetime.

Queries for alerts are often across triggers.  reports-style queries 
will likely be in a time range.

For pure distribution John's suggestion of  PRIMARY KEY ((tenantid, 
triggerid, date), alertid)  may be good but would suffer multi-partition 
hits for a time-based query.   We may also want to consider PRIMARY KEY 
((tenantid, date), triggerId, alertid)  with 'date' being some length in 
which most alerts become less interesting.  Like maybe a week.  i don't 
know if all alerts that get generated in an entire week may be too much 
for a partition, maybe a day, or 3 days, or something would be safer.

We have several secondary tables for search.  These should likely be 
partitioned similarly.

let's keep discussing, I'm new at this and may have missed the mark...

On 6/18/2015 12:41 PM, Lucas Ponce wrote:
>
>> CREATE TABLE alerts (
>> tenantid text,
>> triggerid text,
>> date timestamp,
>> alertid text,
>> payload text,
>> PRIMARY KEY ((tenantid, triggerid, date), alertid)
>> )
>>
> Hey John,
>
> Thanks for the tip, I had exactly this approach in my todo-list for adding triggerid in the partition key.
> Also the date is interesting.
>
> The main idea is to locate all info related on an specific trigger close, to avoid stress the full partition.
>
>
>
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>