vRLI Cluster unresponsive as / partition full on 1 node due to multiple .hints file
Recently we've seen a situation where the root partition was full on vRLI appliance.
This was part of a vRLI 3 node cluster.
When this issue occurs, the cassandra service gets into a hung state and then this issue starts impacting other nodes in the cluster as well.
cassandra.log shows service unresponsive due to space issue on the root partition
INFO [HANDSHAKE-XXXXXXX] 2020-03-04 10:47:57,384 OutboundTcpConnection.java:560 - Handshaking version with XXXXXXX INFO [RequestResponseStage-3] 2020-03-04 10:47:57,400 Gossiper.java:1019 - InetAddress /ZZZZZZZ is now UP INFO [GossipStage:1] 2020-03-04 10:47:58,379 StorageService.java:2292 - Node /ZZZZZZZ state jump to NORMAL ERROR [HintsWriteExecutor:1] 2020-03-04 10:48:24,194 CassandraDaemon.java:228 - Exception in thread Thread[HintsWriteExecutor:1,5,main] org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device at org.apache.cassandra.hints.HintsWriteExecutor.flushInternal(HintsWriteExecutor.java:232) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.hints.HintsWriteExecutor.flush(HintsWriteExecutor.java:203) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.hints.HintsWriteExecutor.lambda$flush$1(HintsWriteExecutor.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
The root partition was occupied by a .hprof file along with multiple .hints file and crc32 file getting created in /usr/lib/loginsight/application/lib/apache-cassandra-*/data/hints directory
Background on hints
Hints are one of three ways to support consistency in the system. When replica node is not available coordinator stores mutating data in temporary hint files to proceed as replica is available.
For details look here - https://cassandra.apache.org/doc/latest/operating/hints.html
Ideally, in all vRLI deployments, it's configured that they are deleted after the default 3 hours. But somehow it's not working and hint files stay there seems forever in some environments.
Repairing runs automatically that is an addition way to support consistency in the system.
Manual deletion is solution in this situation.
This is a bug and will be addressed in upcoming releases of vRLI