batch vs async inserts

Tuesday, 29 September 2015

Many of you have probably seen warnings in the Hawkular server log like,

--------------
WARN 15:55:59 Batch of prepared statements for [hawkular_metrics.data,
hawkular_metrics.metrics_idx] is of size 5665, exceeding specified threshold of 5120 by
545.
--------------

This warning is generated due to batch statements being larger that a threshold defined in
cassandra.yaml. It defaults to 5 KB. When the batch statement is larger than that
threshold, Cassandra logs the warning. Note that the threshold is based on the actual size
of the payload, not the number of statements in the batch. We should stop seeing this
warning in 0.7.0 release of Metrics. See HWKMETRICS-252[1] for details.

The general advice in the Cassandra community is to favor async writes in parallel over
batch inserts when you are trying to improve or optimize write performance. Unlogged
batches across multiple partitions is almost always a bad idea. The one exception is with
unlogged batches in which all of the mutations are for the same partition. In that case,
Cassandra performs the writes atomically. This is how we use batch inserts in metrics.
Interestingly I have seen threads on the Cassandra mailing list that still discourage the
use of batch inserts even in this case. This thread[1] provides some really interesting
insights and analysis on unlogged batch inserts vs async inserts. The thread references a
document with some performance analysis that is worth a look. 

[1] https://issues.jboss.org/browse/HWKMETRICS-252
<https://issues.jboss.org/browse/HWKMETRICS-252>
[2] http://www.mail-archive.com/user%40cassandra.apache.org/msg43976.html
<http://www.mail-archive.com/user@cassandra.apache.org/msg43976.html>

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015