[Hawkular-dev] managing cassandra cluster

Tue Sep 6 03:43:58 EDT 2016

Agreed.

What user interface do you have in mind here ? CLI ? JMX ? WebUI ?

On Fri, Sep 2, 2016 at 5:34 PM, John Sanda <jsanda at redhat.com> wrote:

> To date we haven’t really done anything by way of managing/monitoring the
> Cassandra cluster. We need to monitor Cassandra in order to know things
> like:
>
> * When additional nodes are needed
> * When disk space is low
> * When I/O is too slow
> * When more heap space is needed
>
> Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly
> talks about collecting metrics from Cassandra. In terms of managing the
> cluster, I will provide a few concrete examples that have come up recently
> in OpenShift.
>
> Scenario 1: User deploys additional node(s) to reduce the load on cluster
> After the new node has bootstrapped and is running, we need to run
> nodetool cleanup on each node (or run it via JMX) in order to remove
> keys/data that each each node no longer owns; otherwise, disk space won’t
> be freed up. The cleanup operation can potentially be resource intensive as
> it triggers compactions. Given this, we probably want to run it one node at
> a time. Right now the user is left to do this manually.
>
> Scenario 2: User deploys additional node(s) to get replication and fault
> tolerance
> I connect to Cassandra directly via cqlsh and update replication_factor. I
> then need to run repair on each node can be tricky because 1) it is
> resource intensive, 2) can take a long time, 3) prone to failure, and 4)
> Cassandra does not give progress indicators.
>
> Scenario 3: User sets up regularly, scheduled repair to ensure data is
> consistent across cluster
> Once replication_factor > 1, repair needs to be run on a regular basis.
> More specifically it should be run within gc_grace_seconds which is
> configured per table and defaults to 10 days. The data table in metrics has
> reduced gc_grace_seconds to 1 day and probably reduce it to zero since it
> is append-only. The value for gc_grace_seconds might vary per table based
> on access patterns, which means the frequency of repair should vary as well.
>
>
> There has already been some discussion of these things for Hawkular
> Metrics in the context of OpenShift. It applies to all of Hawkular Services
> as well. Initially I was thinking about building some management components
> directly in metrics, but it probably makes more sense as a separate, shared
> component (or components) that can be reused in both stand alone metrics in
> OpenShift and a full Hawkular Services deployment in MiQ for example.
>
> We are already running into these scenarios in OpenShift and probably need
> to start putting something in place sooner rather than later.
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hawkular-dev/attachments/20160906/45d530db/attachment.html