Troubleshooting Cassandra performance issues

5 min readSep 9, 2022

Is the problem with reads or writes? Table specific or in general? Node specific? These are the types of questions you should ask yourself.

Write Latency

To troubleshoot write performance, you have to understand what happens during a write and where the problems could manifest:

From the image above, we can see that there are a few places where issues can arise:

Could the memtable be too small causing a lot of flushing?
How big are the flushes? How often are the flushes? What threads are causing the flushing?
Once on disk, we have compactions, so is there a problem with compactions? Disk performance?

To troubleshoot, what I do is grab the diags from my nodes using something like:

GitHub - datastax/ds-support-diagnostic-collection: Scripts for collection of diagnostic…

This directory contains a set of scripts to generate a diagnostic tarball for DSE, DDAC & open source Cassandra…

github.com

Then, once I have that info. I can cd into my diagnostic directory and run something like:

GitHub - StevenLacerda/greps: all of my greps

Just a script of greps added over time. Run the script by putting it in your path, and then: From inside a diag…

github.com

When using the above, you will not have access to sperf or Nibbler, so you won’t get all of the info, but you can run greps -g and that will provide a file with a bunch of information. Let’s take a look.

Here’s one important grep:

FLUSHES BY THREAD
 418 ScheduledTasks
 456 COMMIT
2729 RMI TCP Connection
93397 SlabPoolCleaner

From the above, we can see that most of the flushes are from SlabPoolCleanerwhich is normal. That’s the memtable filling up, thus causing the flush. Normal operation.

However, one thing of note in the above is the COMMIT flushes. That means the commitlog_total_space_in_mb is too small and it’s forcing a flush. You should never see COMMIT flushes.

Here’s an incomplete list of the flush types:

How large are my flushes:

LARGEST 20 FLUSHES
802.935MiB   (20%)  on-heap,  0.000KiB  (0%)  off-heap
802.935MiB   (20%)  on-heap,  0.000KiB  (0%)  off-heap
799.164MiB   (20%)  on-heap,  0.000KiB  (0%)  off-heap

That’s not bad. If you’re seeing 200–300MiB flushes, then that’s probably too small. Typically, you want to be about 1–2GiB, but that’s not a hard rule. If you’re seeing pending mutations, then you’ll want that flush size to be smaller. If you’re seeing pending compactions, then you’ll want to increase that flush size to force larger compactions but less often.

To increase the flush size, you need to know what’s causing the flushing and we know that from step 1 above the main thread is SlabPoolCleaner. We can increase the memtable size or increase the memtable_cleanup_threshold.

Are compactions backing up? Check for StatusLogger messages, which in later Cassandra versions prints every 5 minutes (I think that’s the default) or whenever there’s slowness. In StatusLogger you’ll get whatever is pending, which is extremely important for troubleshooting.

Pending compactions — slow down the flushing activity by making the flushes larger and less often.

Pending writes — check disk io iostat -xcdt 1 360 | tee "iostat-$(hostname -i)-$(date '+%Y-%m-%d--%T').output". What do queue sizes look like? Anything above 5 is not good and sustained above 5 is really not good.

Other possible write issues:

Too many mv’s or SASI’s. You could be creating a lot of indexing and/or write amplification.
Huge rows. So you’re serializing and deserializing a lot of data.

Read Latency

Again, with reads, we need to understand the read path:

We have the following possible scenarios:

Bad queries
Disk io issues
Write issues bleeding into reads

The key to reads is that they are by partition key. If you have non-partition key reads, you should be expecting problems. The partition key hashes into a token that specifies which nodes in the cluster own the partition in question. If you do not provide the full partition key, your query becomes a scatter gather and all nodes will be hit and that could be devastating for performance.

Other bad queries would include batches that are not batched by partition key. If you have a batch that’s going across partition keys, then don’t batch.

Are you using prepared statements that are constantly being re-prepared? You can see something like X number of prepared statements discarded in the system.log. This is typical when the prepared statement is incorrectly done. In that case, it can hog memory on the cluster, and cause unnecessary round trips over the network.

To check for slow queries you can do a few things:

Use nodetool settraceprobability .1 to capture your queries live on the system.
Look for slow queries in the logs. By default anything over 500ms is logged in the system.log file.
Use your driver application to print anything over a specified time period.

Regarding disk io issues, we discussed in write latencies, so I won’t go back into that again.

Write issues bleeding into reads would include general thread constraints, or things like compactions. Compactions are probably one of the most io intensive operations, so if compactions are backing up, then that could affect your reads.

Look at the number of sstables per read in nodetool tablehistograms. Do you see a spike when you’re seeing write latencies. If so, you could be using the wrong compaction strategy for the type of io on that table. What could be happening is that your table has a lot of old data that hasn’t been compacted, so if there are a lot of UPDATE or DELETE operations then you could have a lot of disjointed data in sstable files. In that case, for temporary relief you can run nodetool compact -s <keyspace> <table>.

Tombstones are always a problem, so if you DELETE data a lot, then you may need to run manual compactions as per the above command.

Other read problems could simply be lack of resources. Are your queries spiking CPU? Memory constrained?

Another thing, don’t always blame the cluster. You have to know when to look at the driver/application nodes, network, versus the cluster. In situations where my tablehistogram (node level) and proxyhistogram (coordinator level) metrics look good and I don’t see any issues in the logs (slow queries, timeouts, etc), then I look to the network and/or application.

Check the network, you can start with something basic like a traceroute to check the times. And then work to something like mtr, or some other network diagnostic tool.

Lastly, the client:

Is it constrained, lacking resources? High CPU or memory utilization?
Are packets queueing? You’re using .blockor sync code instead of async code?
Do you have throttling set in the drivers application.conf parameters?
How many connections do you have for pool.local.size? Try moving to 8 as a start and seeing how that goes.

Troubleshooting read and write issues is an art that is learned over time, but I hope this helps you.