Agreed.
What user interface do you have in mind here ? CLI ? JMX ? WebUI ?
On Fri, Sep 2, 2016 at 5:34 PM, John Sanda <jsanda(a)redhat.com> wrote:
To date we haven’t really done anything by way of managing/monitoring
the
Cassandra cluster. We need to monitor Cassandra in order to know things
like:
* When additional nodes are needed
* When disk space is low
* When I/O is too slow
* When more heap space is needed
Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly
talks about collecting metrics from Cassandra. In terms of managing the
cluster, I will provide a few concrete examples that have come up recently
in OpenShift.
Scenario 1: User deploys additional node(s) to reduce the load on cluster
After the new node has bootstrapped and is running, we need to run
nodetool cleanup on each node (or run it via JMX) in order to remove
keys/data that each each node no longer owns; otherwise, disk space won’t
be freed up. The cleanup operation can potentially be resource intensive as
it triggers compactions. Given this, we probably want to run it one node at
a time. Right now the user is left to do this manually.
Scenario 2: User deploys additional node(s) to get replication and fault
tolerance
I connect to Cassandra directly via cqlsh and update replication_factor. I
then need to run repair on each node can be tricky because 1) it is
resource intensive, 2) can take a long time, 3) prone to failure, and 4)
Cassandra does not give progress indicators.
Scenario 3: User sets up regularly, scheduled repair to ensure data is
consistent across cluster
Once replication_factor > 1, repair needs to be run on a regular basis.
More specifically it should be run within gc_grace_seconds which is
configured per table and defaults to 10 days. The data table in metrics has
reduced gc_grace_seconds to 1 day and probably reduce it to zero since it
is append-only. The value for gc_grace_seconds might vary per table based
on access patterns, which means the frequency of repair should vary as well.
There has already been some discussion of these things for Hawkular
Metrics in the context of OpenShift. It applies to all of Hawkular Services
as well. Initially I was thinking about building some management components
directly in metrics, but it probably makes more sense as a separate, shared
component (or components) that can be reused in both stand alone metrics in
OpenShift and a full Hawkular Services deployment in MiQ for example.
We are already running into these scenarios in OpenShift and probably need
to start putting something in place sooner rather than later.
_______________________________________________
hawkular-dev mailing list
hawkular-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hawkular-dev