How To Test Whether Slow Disk Performance Is Causing Issues Within Automation Suite?

How to determine if slow IOPS is contributing to problems in my Automation Suite cluster by conducting tests with FIO?

Issue Description:

Automation Suite employs advanced storage backends for pod storage; requiring a high disk I/O of a minimum IOPS (Input/Output Operations Per Second) threshold of 1100. Speeds falling below this threshold can manifest as slow response times or unavailability. Identifying whether the issue is due to the underlying disk performance is crucial for troubleshooting. The fio tool can be used to simulate disk I/O and gather relevant performance metrics.

An example of a user encountering such an issue is demonstrated in the provided fio output, where the disk's write performance is tested. The output includes metrics like throughput, latency, and CPU usage, essential for assessing the health of the storage device.


Root Cause: The root cause of slow disk performance in Automation Suite clusters can vary and might be attributed to:

  • Hardware Limitations: The physical disk may not meet the performance requirements of the workload.
    • Min IOPS: 1100
  • Configuration Issues: Misconfiguration in Rook-Ceph or Kubernetes storage settings can lead to suboptimal performance.
  • Overloaded Systems: High demand on the storage system may exceed its capabilities.
    • High CPU util
  • Network Bottlenecks: In some cases, network issues can impact the performance of storage in distributed environments like Kubernetes.
    • This is particularly true for environments that utilize SAN storage or other network storage solutions.


Resolution: The following is a general outline of diagnosing such an issue,

  1. Create RKE2 support bundle: This should always be the first step. Link for the script is found here: Automation Suite Support Tools / Script/ General Tools/ rke2logCollector.sh
    1. Included in this is iostat information and other vital logs an assessing an issue with a disk.
    2. Other test will re-validate some of this information, but when a support request is raised, this is the information our team needs to help validate if there is a disk problem.
  2. Run fio Test: Use fio to simulate the disk I/O workload that your Automation Suite pods typically experience. Adjust parameters like block size, read/write pattern, and test duration to match your workload. Refer to the provided example for guidance.
    1. Analyze Output: Examine the fio output for key metrics:
      • Throughput: Compare this with your disk’s expected performance.
      • Latency: Check for higher than expected latency.
      • CPU Usage: Ensure CPU usage is not abnormally high during the test.
      • I/O Depth and Disk Stats: Look for signs of excessive queuing or disk utilization.
  3. Compare with Baselines: If possible, compare the results with baseline metrics or benchmarks for the specific hardware.
  4. Check storage component logs: If fio indicates poor disk performance, examine deployment logs of storage components like Rook-Ceph OSDs for any signs of overload or malfunction. Refer to the 'Error' field provided below for a sample depiction.
  5. Review Pod Logs: For pods experiencing issues, check their logs for errors or warnings related to disk I/O.


1a: Example command

# Create a test directory to use
mkdir -p test-data

# Example fio command specifying the test directory created from above using some standard parameters 
fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=100m --bs=2300 --name=mytest

1b: Example output
image.png

1c: Example output breakdown

  1. Test Configuration: The example command above sets the test to perform write operations (--rw=write) with a block size of 2300 bytes (--bs=2300). The test was run in a directory named test-data with a total test size of 1000MB.

  2. Performance Metrics: The output shows various metrics like throughput (BW=55.3KiB/s), latency (clat and lat), and CPU usage. These metrics give you an idea of how fast and efficiently the disk is handling write operations.

  3. Latency Distribution: The output includes detailed percentiles for latency (clat percentiles), which can help understand the consistency and distribution of response times during the test.

  4. Synchronization Times: There are details about fsync/fdatasync times, indicating how long it takes to synchronize the disk's write cache with the stored data.

  5. CPU and I/O Depth: The cpu section shows the CPU usage during the test, and the IO depths section provides insights into how deep the I/O requests were queued.

  6. Disk Stats: At the end, there's a summary of disk stats including the number of I/O operations (ios), time spent on I/O operations (ticks), and overall utilization percentage (util).