Re: [Hawkular-dev] managing cassandra cluster

Tuesday, 6 September 2016

----- Original Message -----
...
 From: "John Sanda" <jsanda(a)redhat.com&gt;
 To: "Discussions around Hawkular development"
<hawkular-dev(a)lists.jboss.org&gt;
 Sent: Friday, 2 September, 2016 11:34:07 AM
 Subject: [Hawkular-dev] managing cassandra cluster

 To date we haven’t really done anything by way of managing/monitoring the
 Cassandra cluster. We need to monitor Cassandra in order to know things
 like:

 * When additional nodes are needed
 * When disk space is low
 * When I/O is too slow
 * When more heap space is needed

 Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly
 talks about collecting metrics from Cassandra. In terms of managing the
 cluster, I will provide a few concrete examples that have come up recently
 in OpenShift.

 Scenario 1: User deploys additional node(s) to reduce the load on cluster
 After the new node has bootstrapped and is running, we need to run nodetool
 cleanup on each node (or run it via JMX) in order to remove keys/data that
 each each node no longer owns; otherwise, disk space won’t be freed up. The
 cleanup operation can potentially be resource intensive as it triggers
 compactions. Given this, we probably want to run it one node at a time.
 Right now the user is left to do this manually.

 Scenario 2: User deploys additional node(s) to get replication and fault
 tolerance
 I connect to Cassandra directly via cqlsh and update replication_factor. I
 then need to run repair on each node can be tricky because 1) it is resource
 intensive, 2) can take a long time, 3) prone to failure, and 4) Cassandra
 does not give progress indicators.

 Scenario 3: User sets up regularly, scheduled repair to ensure data is
 consistent across cluster
 Once replication_factor > 1, repair needs to be run on a regular basis. More
 specifically it should be run within gc_grace_seconds which is configured
 per table and defaults to 10 days. The data table in metrics has reduced
 gc_grace_seconds to 1 day and probably reduce it to zero since it is
 append-only. The value for gc_grace_seconds might vary per table based on
 access patterns, which means the frequency of repair should vary as well.

 There has already been some discussion of these things for Hawkular Metrics
 in the context of OpenShift. It applies to all of Hawkular Services as well.
 Initially I was thinking about building some management components directly
 in metrics, but it probably makes more sense as a separate, shared component
 (or components) that can be reused in both stand alone metrics in OpenShift
 and a full Hawkular Services deployment in MiQ for example. 
On OpenShift, the ideal situation here would be to have the Cassandra instances themselves
expose a metric that we can use to determine when the Cassandra cluster is under too much
load and needs to scale up. The HPA would then read this metric and automatically scale
the cluster up if needed.

If we determine that cannot be done for whatever reason and that Hawkular Metrics needs to
determine when to scale or not, there are ways we can do this. But it gets a little bit
more tricky. If given the right permissions we can go out to the cluster and do things
like scale up components, perform operations of the Cassandra containers directly, etc.
Ideally the HPA should be handling this, but we could get around it if absolutely needed.

...

 We are already running into these scenarios in OpenShift and probably need to
 start putting something in place sooner rather than later.
 _______________________________________________
 hawkular-dev mailing list
 hawkular-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/hawkular-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Hawkular-dev] managing cassandra cluster