[Hawkular-dev] managing cassandra cluster

John Sanda jsanda at redhat.com
Mon Sep 12 14:35:50 EDT 2016


> On Sep 6, 2016, at 11:45 AM, Matt Wringe <mwringe at redhat.com> wrote:
> 
> ----- Original Message -----
>> From: "Michael Burman" <miburman at redhat.com <mailto:miburman at redhat.com>>
>> To: "Discussions around Hawkular development" <hawkular-dev at lists.jboss.org <mailto:hawkular-dev at lists.jboss.org>>
>> Sent: Tuesday, 6 September, 2016 11:09:45 AM
>> Subject: Re: [Hawkular-dev] managing cassandra cluster
>> 
>> Hi,
>> 
>> Well, actually I would say for Openshift we should try to hit a single
>> container strategy, which has both HWKMETRICS & Cassandra deployed. This is
>> to ensure that some of the operations we're going to do in the future can
>> take advantage of data locality.
> 
> Hawkular Metrics and Cassandra are not cheap components to run in a cluster, they take up a lot of resources. If we can run multiple Cassandras for every 1 Hawkular (or vice versa) then we can use a lot less resources than if we always need to scale both up because one has become a bottleneck.
> 
> I would really not want to bundle these together if we don't have to.
> 
> What operations would require bundling them together exactly?
> 
>> So unless we run a separate service inside the Cassandra containers, there's
>> no easy single metric to get from Cassandra that provides "needs scaling".
> 
> This is what I was expecting, I was not expecting Cassandra itself to provide a nice default metric to scale or not based on a specific value. We would have some service running along with Cassandra monitoring itself and the cluster to determine if scaling is required or not.

Here’s what I had in mind at least for an initial effort. We package our own metrics reporter[1] with Cassandra. It pushes data to hawkular metrics. We provide a management endpoint that will indicate whether or not scaling is necessary based on the metrics we collect.

[1] http://www.datastax.com/dev/blog/pluggable-metrics-reporting-in-cassandra-2-0-2 <http://www.datastax.com/dev/blog/pluggable-metrics-reporting-in-cassandra-2-0-2>
> 
>> 
>>  - Micke
>> 
>> ----- Original Message -----
>> From: "Matt Wringe" <mwringe at redhat.com>
>> To: "Discussions around Hawkular development" <hawkular-dev at lists.jboss.org>
>> Sent: Tuesday, September 6, 2016 5:26:07 PM
>> Subject: Re: [Hawkular-dev] managing cassandra cluster
>> 
>> 
>> 
>> ----- Original Message -----
>>> From: "John Sanda" <jsanda at redhat.com>
>>> To: "Discussions around Hawkular development"
>>> <hawkular-dev at lists.jboss.org>
>>> Sent: Friday, 2 September, 2016 11:34:07 AM
>>> Subject: [Hawkular-dev] managing cassandra cluster
>>> 
>>> To date we haven’t really done anything by way of managing/monitoring the
>>> Cassandra cluster. We need to monitor Cassandra in order to know things
>>> like:
>>> 
>>> * When additional nodes are needed
>>> * When disk space is low
>>> * When I/O is too slow
>>> * When more heap space is needed
>>> 
>>> Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly
>>> talks about collecting metrics from Cassandra. In terms of managing the
>>> cluster, I will provide a few concrete examples that have come up recently
>>> in OpenShift.
>>> 
>>> Scenario 1: User deploys additional node(s) to reduce the load on cluster
>>> After the new node has bootstrapped and is running, we need to run nodetool
>>> cleanup on each node (or run it via JMX) in order to remove keys/data that
>>> each each node no longer owns; otherwise, disk space won’t be freed up. The
>>> cleanup operation can potentially be resource intensive as it triggers
>>> compactions. Given this, we probably want to run it one node at a time.
>>> Right now the user is left to do this manually.
>>> 
>>> Scenario 2: User deploys additional node(s) to get replication and fault
>>> tolerance
>>> I connect to Cassandra directly via cqlsh and update replication_factor. I
>>> then need to run repair on each node can be tricky because 1) it is
>>> resource
>>> intensive, 2) can take a long time, 3) prone to failure, and 4) Cassandra
>>> does not give progress indicators.
>>> 
>>> Scenario 3: User sets up regularly, scheduled repair to ensure data is
>>> consistent across cluster
>>> Once replication_factor > 1, repair needs to be run on a regular basis.
>>> More
>>> specifically it should be run within gc_grace_seconds which is configured
>>> per table and defaults to 10 days. The data table in metrics has reduced
>>> gc_grace_seconds to 1 day and probably reduce it to zero since it is
>>> append-only. The value for gc_grace_seconds might vary per table based on
>>> access patterns, which means the frequency of repair should vary as well.
>>> 
>>> 
>>> There has already been some discussion of these things for Hawkular Metrics
>>> in the context of OpenShift. It applies to all of Hawkular Services as
>>> well.
>>> Initially I was thinking about building some management components directly
>>> in metrics, but it probably makes more sense as a separate, shared
>>> component
>>> (or components) that can be reused in both stand alone metrics in OpenShift
>>> and a full Hawkular Services deployment in MiQ for example.
>> 
>> On OpenShift, the ideal situation here would be to have the Cassandra
>> instances themselves expose a metric that we can use to determine when the
>> Cassandra cluster is under too much load and needs to scale up. The HPA
>> would then read this metric and automatically scale the cluster up if
>> needed.
>> 
>> If we determine that cannot be done for whatever reason and that Hawkular
>> Metrics needs to determine when to scale or not, there are ways we can do
>> this. But it gets a little bit more tricky. If given the right permissions
>> we can go out to the cluster and do things like scale up components, perform
>> operations of the Cassandra containers directly, etc. Ideally the HPA should
>> be handling this, but we could get around it if absolutely needed.
>> 
>>> 
>>> We are already running into these scenarios in OpenShift and probably need
>>> to
>>> start putting something in place sooner rather than later.
>>> _______________________________________________
>>> hawkular-dev mailing list
>>> hawkular-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>>> 
>> 
>> _______________________________________________
>> hawkular-dev mailing list
>> hawkular-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>> 
>> _______________________________________________
>> hawkular-dev mailing list
>> hawkular-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>> 
> 
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org <mailto:hawkular-dev at lists.jboss.org>
> https://lists.jboss.org/mailman/listinfo/hawkular-dev <https://lists.jboss.org/mailman/listinfo/hawkular-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hawkular-dev/attachments/20160912/0cf00f8d/attachment-0001.html 


More information about the hawkular-dev mailing list