Encountering error 'Error Loading Table Data' while accessing dataset/pipeline/skill list or the data does not show up in User Interface.
Issue Description:
If any functionality issue occurs from a UI perspective where the dataset/pipelines/skills are not seen, check the OSD usage in the cluster.
Error: 'Error Loading Table Data'
Command: kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
The output should be like the screenshot below:
Observe the OSD usage reaching 95% which puts the cluster in read-only mode and does not allow it to input any further data, thus affecting functionality.
Root Cause: This can occur due to streaming logs from the skill filling up the usage, which has been disabled by default in v22.4.3 of AI Center.
Resolution:
Note: In all previous versions, manual GC cleanup needs to be executed and streaming logs disabled explicitly.
- To run the GC, run the below commands one after the other without any interval in between:
- kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set-full-ratio 0.96
- Note that setting the OSD ratio beyond 0.96 can be tricky as it can make the cluster unrecoverable, thus 0.96 is the max ratio one should set.
- kubectl -n rook-ceph exec deploy/rook-ceph-tools -- timeout 60 radosgw-admin gc process --include-all
- Post executing the commands shared above, allow some time before re-checking the OSD usage. The OSD usage should go down in some time. Hence, it is recommended to wait and continuously check in every 10-15 mins intervals. Once it decreases significantly, execute the below command:
- kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set-full-ratio 0.95
#Set the ratio back to what it was initially.
- While monitoring the OSD usage, perform the commands below:
- vi rook-ceph-gc.yaml (# This will create a file named rook-ceph-gc)
- Paste the below content without making changes:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rook-config-override
namespace: rook-ceph
data:
config: |
[client]
rgw gc max objs = 1024
rgw gc obj min wait = 5
rgw gc processor max time = 60
rgw gc processor period = 60
rgw gc max concurrent io = 50
rgw gc max trim chunk = 1024
[osd]
osd recovery sleep hdd = 0.0
osd max backfills = 8
osd recovery max active hdd = 8
osd recovery max single start = 8
- Save the file by hitting the 'Esc' key followed by wq!
- Validate the content of the created file: cat rook-ceph-gc.yaml (# it should be the same content as shared above)
- Execute: kubectl apply -f rook-ceph-gc.yaml (# kubectl apply is a command used to apply a configuration file to a Kubernetes cluster)
- Delete the rook-ceph pods:
- kubectl -n rook-ceph get pods (#Note down the name of all pods. Replace the value between <> with the respective pod name)
- kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas=0
- kubectl -n rook-ceph delete pod
- kubectl -n rook-ceph delete pod
- kubectl -n rook-ceph delete pod
- kubectl -n rook-ceph delete pod
- kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas=1
- Post executing the above commands, wait and let all the rook-ceph pods come up.
- Followed by checking the usage: kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
Note: To make sure that the changes have been applied correctly, access the AI Center UI and try performing certain actions like executing a pipeline, creating a skill, etc and continue monitoring the OSD usage.
The above steps should bring up the cluster, and make the AI Center fully functional and accessible.
Refer to the below documentation to disable the streaming logs: Disable Log Streaming On Existing Skills.