[Hawkular-dev] managing cassandra cluster

Tue Sep 6 11:45:09 EDT 2016

----- Original Message -----
> From: "Michael Burman" <miburman at redhat.com>
> To: "Discussions around Hawkular development" <hawkular-dev at lists.jboss.org>
> Sent: Tuesday, 6 September, 2016 11:09:45 AM
> Subject: Re: [Hawkular-dev] managing cassandra cluster
> 
> Hi,
> 
> Well, actually I would say for Openshift we should try to hit a single
> container strategy, which has both HWKMETRICS & Cassandra deployed. This is
> to ensure that some of the operations we're going to do in the future can
> take advantage of data locality.

Hawkular Metrics and Cassandra are not cheap components to run in a cluster, they take up a lot of resources. If we can run multiple Cassandras for every 1 Hawkular (or vice versa) then we can use a lot less resources than if we always need to scale both up because one has become a bottleneck.

I would really not want to bundle these together if we don't have to.

What operations would require bundling them together exactly?

> So unless we run a separate service inside the Cassandra containers, there's
> no easy single metric to get from Cassandra that provides "needs scaling".

This is what I was expecting, I was not expecting Cassandra itself to provide a nice default metric to scale or not based on a specific value. We would have some service running along with Cassandra monitoring itself and the cluster to determine if scaling is required or not.

> 
>   - Micke
> 
> ----- Original Message -----
> From: "Matt Wringe" <mwringe at redhat.com>
> To: "Discussions around Hawkular development" <hawkular-dev at lists.jboss.org>
> Sent: Tuesday, September 6, 2016 5:26:07 PM
> Subject: Re: [Hawkular-dev] managing cassandra cluster
> 
> 
> 
> ----- Original Message -----
> > From: "John Sanda" <jsanda at redhat.com>
> > To: "Discussions around Hawkular development"
> > <hawkular-dev at lists.jboss.org>
> > Sent: Friday, 2 September, 2016 11:34:07 AM
> > Subject: [Hawkular-dev] managing cassandra cluster
> > 
> > To date we haven’t really done anything by way of managing/monitoring the
> > Cassandra cluster. We need to monitor Cassandra in order to know things
> > like:
> > 
> > * When additional nodes are needed
> > * When disk space is low
> > * When I/O is too slow
> > * When more heap space is needed
> > 
> > Cassandra exposes a lot of metrics. I created HWKMETRICS-448. It briefly
> > talks about collecting metrics from Cassandra. In terms of managing the
> > cluster, I will provide a few concrete examples that have come up recently
> > in OpenShift.
> > 
> > Scenario 1: User deploys additional node(s) to reduce the load on cluster
> > After the new node has bootstrapped and is running, we need to run nodetool
> > cleanup on each node (or run it via JMX) in order to remove keys/data that
> > each each node no longer owns; otherwise, disk space won’t be freed up. The
> > cleanup operation can potentially be resource intensive as it triggers
> > compactions. Given this, we probably want to run it one node at a time.
> > Right now the user is left to do this manually.
> > 
> > Scenario 2: User deploys additional node(s) to get replication and fault
> > tolerance
> > I connect to Cassandra directly via cqlsh and update replication_factor. I
> > then need to run repair on each node can be tricky because 1) it is
> > resource
> > intensive, 2) can take a long time, 3) prone to failure, and 4) Cassandra
> > does not give progress indicators.
> > 
> > Scenario 3: User sets up regularly, scheduled repair to ensure data is
> > consistent across cluster
> > Once replication_factor > 1, repair needs to be run on a regular basis.
> > More
> > specifically it should be run within gc_grace_seconds which is configured
> > per table and defaults to 10 days. The data table in metrics has reduced
> > gc_grace_seconds to 1 day and probably reduce it to zero since it is
> > append-only. The value for gc_grace_seconds might vary per table based on
> > access patterns, which means the frequency of repair should vary as well.
> > 
> > 
> > There has already been some discussion of these things for Hawkular Metrics
> > in the context of OpenShift. It applies to all of Hawkular Services as
> > well.
> > Initially I was thinking about building some management components directly
> > in metrics, but it probably makes more sense as a separate, shared
> > component
> > (or components) that can be reused in both stand alone metrics in OpenShift
> > and a full Hawkular Services deployment in MiQ for example.
> 
> On OpenShift, the ideal situation here would be to have the Cassandra
> instances themselves expose a metric that we can use to determine when the
> Cassandra cluster is under too much load and needs to scale up. The HPA
> would then read this metric and automatically scale the cluster up if
> needed.
> 
> If we determine that cannot be done for whatever reason and that Hawkular
> Metrics needs to determine when to scale or not, there are ways we can do
> this. But it gets a little bit more tricky. If given the right permissions
> we can go out to the cluster and do things like scale up components, perform
> operations of the Cassandra containers directly, etc. Ideally the HPA should
> be handling this, but we could get around it if absolutely needed.
> 
> > 
> > We are already running into these scenarios in OpenShift and probably need
> > to
> > start putting something in place sooner rather than later.
> > _______________________________________________
> > hawkular-dev mailing list
> > hawkular-dev at lists.jboss.org
> > https://lists.jboss.org/mailman/listinfo/hawkular-dev
> > 
> 
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev
> 
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>