[Hawkular-dev] [Metrics] How to react on low disk?

Sat Apr 2 00:50:40 EDT 2016

For whatever I just saw this for this for the first time earlier today. If additional capacity is needed, the primary solution should be to deploy additional Cassandra nodes in order to decrease the load per node. Running compactions manually, i.e., major compactions, should be considered a last resort. Cassandra runs compactions automatically. These are referred to as minor compactions. With a major compaction, Cassandra will merge together all SSTables. With a minor compaction and size-tiered compaction which we use for the data table, Cassandra will merge 4 tables of similar size. The settings are configurable but we can stick with the defaults for the discussion. The reason running major compactions is discouraged is because it often results in minor compactions no longer being executed due to the way size-tiered compaction works. This means that users would have to set up a job to manually run compactions. Not the end of the world, but something that definitely needs to be taken into consideration.

Another thing to consider is how deletes are handled by Cassandra. Deleted data is not immediately purged from disk. There is a grace period which is configured per table. It defaults to 10 days. Deleted data is only purged after that grace period. So if we have a bunch of data files on disk that consist of only non-deleted data and/or deleted data for which the grace period has not passed, running a major compaction probably won’t reclaim much space. With HWKMETRICS-367, we have reduced the grace period from the default of 10 days to one day.

We need documentation to explain all of this as well as docs to provide some sizing guidelines. I have also floated the idea of a trigger that could either reject writes or send an email notification when disk usage exceeds a threshold, which should be well before the disk is full. If the disk is close to being full and if the data set is large enough, running a compaction could fail as it requires additional disk space.

> On Mar 11, 2016, at 11:55 AM, Heiko W.Rupp <hrupp at redhat.com> wrote:
> 
> Hey,
> 
> <captain_obvious>
> so for Hawkular-metrics (and Hawkular) we store the data in a Cassandra 
> database that puts files on a disk,
> which can get full earlier than expected (and usually on week-ends). And 
> when the disk is full, Metrics does not like it.
> </captain_obvious>
> 
> What can we do in this case?
> 
> I could imagine that on the C* nodes we run a script that uses df to 
> figure out the available space and tries
> to run some compaction if space gets tight.
> 
> Of course that does not solve the issue per se, but should give some air 
> to breathe.
> Right now I fear we are not able to reduce the ttl of a metric/tenant on 
> the fly and have metrics do the right thing - at least if I understood 
> yak correctly.
> That script should possibly also send an email to an admin.
> 
> In case that we run Hawkular-full, we can determine the disk space and 
> feed that into Hawkular for Alerts to pick it up and then have the 
> machinery trigger the compaction and send the email.
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev