[Hawkular-dev] managing cassandra cluster

John Sanda jsanda at redhat.com
Mon Sep 12 14:26:12 EDT 2016


I’d like to expose something in the REST API.

> On Sep 6, 2016, at 3:43 AM, Thomas Heute <theute at redhat.com> wrote:
> 
> Agreed.
> 
> What user interface do you have in mind here ? CLI ? JMX ? WebUI ?
> 
> 
> 
> On Fri, Sep 2, 2016 at 5:34 PM, John Sanda <jsanda at redhat.com <mailto:jsanda at redhat.com>> wrote:
> To date we haven’t really done anything by way of managing/monitoring the Cassandra cluster. We need to monitor Cassandra in order to know things like:
> 
> * When additional nodes are needed
> * When disk space is low
> * When I/O is too slow
> * When more heap space is needed
> 
> Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly talks about collecting metrics from Cassandra. In terms of managing the cluster, I will provide a few concrete examples that have come up recently in OpenShift.
> 
> Scenario 1: User deploys additional node(s) to reduce the load on cluster
> After the new node has bootstrapped and is running, we need to run nodetool cleanup on each node (or run it via JMX) in order to remove keys/data that each each node no longer owns; otherwise, disk space won’t be freed up. The cleanup operation can potentially be resource intensive as it triggers compactions. Given this, we probably want to run it one node at a time. Right now the user is left to do this manually.
> 
> Scenario 2: User deploys additional node(s) to get replication and fault tolerance
> I connect to Cassandra directly via cqlsh and update replication_factor. I then need to run repair on each node can be tricky because 1) it is resource intensive, 2) can take a long time, 3) prone to failure, and 4) Cassandra does not give progress indicators.
> 
> Scenario 3: User sets up regularly, scheduled repair to ensure data is consistent across cluster
> Once replication_factor > 1, repair needs to be run on a regular basis. More specifically it should be run within gc_grace_seconds which is configured per table and defaults to 10 days. The data table in metrics has reduced gc_grace_seconds to 1 day and probably reduce it to zero since it is append-only. The value for gc_grace_seconds might vary per table based on access patterns, which means the frequency of repair should vary as well.
> 
> 
> There has already been some discussion of these things for Hawkular Metrics in the context of OpenShift. It applies to all of Hawkular Services as well. Initially I was thinking about building some management components directly in metrics, but it probably makes more sense as a separate, shared component (or components) that can be reused in both stand alone metrics in OpenShift and a full Hawkular Services deployment in MiQ for example.
> 
> We are already running into these scenarios in OpenShift and probably need to start putting something in place sooner rather than later.
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org <mailto:hawkular-dev at lists.jboss.org>
> https://lists.jboss.org/mailman/listinfo/hawkular-dev <https://lists.jboss.org/mailman/listinfo/hawkular-dev>
> 
> 
> 
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hawkular-dev/attachments/20160912/722499e6/attachment.html 


More information about the hawkular-dev mailing list