<div dir="ltr">Agreed.<div><br></div><div>What user interface do you have in mind here ? CLI ? JMX ? WebUI ?</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Sep 2, 2016 at 5:34 PM, John Sanda <span dir="ltr">&lt;<a href="mailto:jsanda@redhat.com" target="_blank">jsanda@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">To date we haven’t really done anything by way of managing/monitoring the Cassandra cluster. We need to monitor Cassandra in order to know things like:<br>

<br>

* When additional nodes are needed<br>

* When disk space is low<br>

* When I/O is too slow<br>

* When more heap space is needed<br>

<br>

Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly talks about collecting metrics from Cassandra. In terms of managing the cluster, I will provide a few concrete examples that have come up recently in OpenShift.<br>

<br>

Scenario 1: User deploys additional node(s) to reduce the load on cluster<br>

After the new node has bootstrapped and is running, we need to run nodetool cleanup on each node (or run it via JMX) in order to remove keys/data that each each node no longer owns; otherwise, disk space won’t be freed up. The cleanup operation can potentially be resource intensive as it triggers compactions. Given this, we probably want to run it one node at a time. Right now the user is left to do this manually.<br>

<br>

Scenario 2: User deploys additional node(s) to get replication and fault tolerance<br>

I connect to Cassandra directly via cqlsh and update replication_factor. I then need to run repair on each node can be tricky because 1) it is resource intensive, 2) can take a long time, 3) prone to failure, and 4) Cassandra does not give progress indicators.<br>

<br>

Scenario 3: User sets up regularly, scheduled repair to ensure data is consistent across cluster<br>

Once replication_factor &gt; 1, repair needs to be run on a regular basis. More specifically it should be run within gc_grace_seconds which is configured per table and defaults to 10 days. The data table in metrics has reduced gc_grace_seconds to 1 day and probably reduce it to zero since it is append-only. The value for gc_grace_seconds might vary per table based on access patterns, which means the frequency of repair should vary as well.<br>

<br>

<br>

There has already been some discussion of these things for Hawkular Metrics in the context of OpenShift. It applies to all of Hawkular Services as well. Initially I was thinking about building some management components directly in metrics, but it probably makes more sense as a separate, shared component (or components) that can be reused in both stand alone metrics in OpenShift and a full Hawkular Services deployment in MiQ for example.<br>

<br>

We are already running into these scenarios in OpenShift and probably need to start putting something in place sooner rather than later.<br>

______________________________<wbr>_________________<br>

hawkular-dev mailing list<br>

<a href="mailto:hawkular-dev@lists.jboss.org">hawkular-dev@lists.jboss.org</a><br>

<a href="https://lists.jboss.org/mailman/listinfo/hawkular-dev" rel="noreferrer" target="_blank">https://lists.jboss.org/<wbr>mailman/listinfo/hawkular-dev</a><br>

<br>

<br>

</blockquote></div><br></div>