Automation Suite 잔여 저장 공간을 확인하는 방법

Minoh_Kang · March 19, 2024, 4:04am

Automation Suite를 설치할 때 objectstore를 외부의 S3 서비스를 사용하지 않고 로컬 디스크에 직접 설치한 경우, 오랜 기간 사용으로 objectstore PVC가 가득 차게 되면 Automation Suite가 정상적으로 작동하지 않는 문제가 발생하게 됩니다. 따라서 주기적으로 objectstore PVC에 잔여 공간이 얼마나 있는지 확인하고 불필요한 object들을 (추후에 필요하다면 백업 후) 삭제하여 잔여 공간을 확보할 필요가 있습니다. 아래의 수행 결과는 Automation Suite 2023.4.2 버전에서 확인한 것입니다.

서버 노드에 ssh 접속 후 아래의 커맨드를 실행하여 rook-ceph-tools pod의 이름을 확인합니다. 아래의 예에서는 마지막 줄에서 해당 pod의 이름을 확인할 수 있습니다.

[root@as-uipath-local ~]# kubectl -n rook-ceph get pod
NAME                                        READY   STATUS    RESTARTS         AGE
rook-ceph-mgr-a-5857cfd8b9-7cn4l            1/1     Running   14 (3h51m ago)   19d
rook-ceph-mon-a-5b77dfdb87-c8hc2            1/1     Running   0                3h52m
rook-ceph-operator-56f4fbb74c-2g9fb         1/1     Running   3 (36h ago)      19d
rook-ceph-osd-0-84c6f7fc66-74zns            1/1     Running   3 (36h ago)      19d
rook-ceph-rgw-rook-ceph-a-596c79d46-frdh8   1/1     Running   7 (3h52m ago)    19d
rook-ceph-tools-5f78f67db6-4hq8k            1/1     Running   3 (36h ago)      19d

아래와 같이 해당 pod의 interactive bash 세션을 생성합니다.

[root@as-uipath-local ~]# kubectl -n rook-ceph exec -it rook-ceph-tools-5f78f67db6-4hq8k -- bash
bash-4.4$

아래와 같이 ceph status 커맨드로 health 정보를 포함한 상태를 확인할 수 있습니다. 싱글 노드 구성인 경우 잔여 공간이 충분해도 redundancy가 없기 때문에 HEALTH_WARN 상태입니다. 멀티 노드 구성이고 잔여 공간에 여유가 있다면 HEALTH_OK 상태가 됩니다.

bash-4.4$ ceph status
  cluster:
    id:     59bf25cd-f1d1-45a6-9527-3c0a080d84ef
    health: HEALTH_WARN
            8 pool(s) have no replicas configured
            OSD count 1 < osd_pool_default_size 3
 
  services:
    mon: 1 daemons, quorum a (age 3h)
    mgr: a(active, since 3h)
    osd: 1 osds: 1 up (since 3h), 1 in (since 5w)
    rgw: 1 daemon active (rook.ceph.a)
 
  task status:
 
  data:
    pools:   8 pools, 81 pgs
    objects: 195.21k objects, 109 GiB
    usage:   111 GiB used, 516 GiB / 628 GiB avail
    pgs:     81 active+clean
 
  io:
    client:   4.2 KiB/s rd, 8.1 KiB/s wr, 4 op/s rd, 12 op/s wr

아래와 같이 ceph df 커맨드로 잔여 공간을 확인할 수 있습니다. pool 중에서 rook-ceph.rgw.log와 rook-ceph.rgw.buckets.data를 주의깊게 살펴볼 필요가 있습니다. 사용 기간이 길어질수록 차지하는 공간이 커져서 %USED 칼럼이 100에 가까워지게 됩니다. 미리미리 확인하여 잔여 공간을 확보해야 합니다.

bash-4.4$ ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
nvme   628 GiB  516 GiB  109 GiB   111 GiB      17.73
TOTAL  628 GiB  516 GiB  109 GiB   111 GiB      17.73
 
--- POOLS ---
POOL                          ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics          1    1   88 KiB        1   88 KiB      0    422 GiB
rook-ceph.rgw.control          2    8      0 B        8      0 B      0    422 GiB
rook-ceph.rgw.meta             3    8  7.4 KiB       37  128 KiB      0    422 GiB
rook-ceph.rgw.log              4    8  1.2 MiB    1.21k  5.1 MiB      0    422 GiB
rook-ceph.rgw.buckets.index    5    8   90 MiB       99   90 MiB   0.02    422 GiB
rook-ceph.rgw.buckets.non-ec   6    8      0 B        0      0 B      0    422 GiB
.rgw.root                      7    8  4.8 KiB       17   64 KiB      0    422 GiB
rook-ceph.rgw.buckets.data     8   32  109 GiB  193.87k  109 GiB  20.59    422 GiB

storage bucket 내의 object 개수와 크기를 상세히 확인하려면 아래의 커맨드를 실행하면 됩니다. ml-model-files, ai-storage, train-data 등의 bucket이 가장 많은 공간을 차지하는 것을 알 수 있습니다.

bash-4.4$ radosgw-admin bucket stats | jq -r '["BucketName","NoOfObjects","SizeInKB"], ["--------------------","------","------"], (.[] | [.bucket, .usage."rgw.main"."num_objects", .usage."rgw.main".size_kb_actual]) | @tsv' | column -ts $'\t'
BucketName                                                     NoOfObjects  SizeInKB
--------------------                                           ------       ------
rook-ceph-bucket-checker-df84b3bf-db2b-4406-8294-dd271cca148b  0            0
ml-model-files                                                 11           10487768
aifabric-staging                                                            
ai-storage                                                     53353        46265432
train-data                                                     119543       55070776
testbucket                                                                  
uipath                                                         8706         2351284
orchestrator-host                                              183          546988
sf-logs                                                        3141         81832

특정 bucket 내의 파일들을 나열하려면 아래의 커맨드를 실행하면 됩니다. 아래의 예에서는 ml-model-files bucket 내의 파일들을 나열했습니다.

bash-4.4$ radosgw-admin bucket list --bucket=ml-model-files | jq -r '.[] | {name: .name, size: .meta.size, mtime: .meta.mtime, owner: .meta.owner} | [.name, .size, .mtime, .owner] | @tsv' | sed 's/T/ /; s/Z//' | column -t
host/67eaa7bb-7089-46ae-8c6f-f5e0451ced61/a1f8621f-9e46-4f68-9419-699cc69ddfb7/10e6e386-53b2-4df0-90b5-9c60f21f69a1/f52e2309-7e4e-4c08-b6db-ccdffcbd5519/MSDS_regular_half_1.zip  2147774663  2024-02-10  18:07:38.683964  aicenter-service
host/67eaa7bb-7089-46ae-8c6f-f5e0451ced61/a1f8621f-9e46-4f68-9419-699cc69ddfb7/41276e74-4223-4720-b28a-7fa82ebbdb16/829f132a-6bf7-4fe1-9163-87d43ec0c709/MSDS_regular.zip         2147861709  2024-02-11  19:08:58.786164  aicenter-service
host/67eaa7bb-7089-46ae-8c6f-f5e0451ced61/a1f8621f-9e46-4f68-9419-699cc69ddfb7/96c155ea-b137-4b9c-8da2-291e94e21e98/d53c33a9-df4a-42ae-b48c-24efe1f67fed/MSDS_all.zip             2148001931  2024-02-12  23:49:16.839669  aicenter-service
host/67eaa7bb-7089-46ae-8c6f-f5e0451ced61/a1f8621f-9e46-4f68-9419-699cc69ddfb7/96ed2feb-70f3-4577-89f9-57526eba02f1/6d9d90de-4ea2-497b-a40e-4fce52af12ee/MSDS_regular_half_2.zip  2147792030  2024-02-11  06:15:50.971176  aicenter-service
host/67eaa7bb-7089-46ae-8c6f-f5e0451ced61/a1f8621f-9e46-4f68-9419-699cc69ddfb7/c96497ba-b808-4123-9fec-0dfd7bee9ff8/8eac784c-a809-499c-8f60-50fa8c48a07d/MSDS_columns.zip         2147770066  2024-02-11  12:36:16.448029  aicenter-service
wrapper/23.4.1-rc1/dataset_download_wrapper.py                                                                                                                                    1465        2024-02-08  12:59:36.315165  aicenter-service
wrapper/23.4.1-rc1/training_wrapper.py                                                                                                                                            2065        2024-02-08  12:59:36.156425  aicenter-service
wrapper/23.4.1-rc1/uipath_core_cv.tar.gz                                                                                                                                          73554       2024-02-08  12:59:15.330146  aicenter-service
wrapper/23.4.1-rc1/uipath_core_default.tar.gz                                                                                                                                     77971       2024-02-08  12:59:15.316216  aicenter-service
wrapper/23.4.1-rc1/uipath_core_du.tar.gz                                                                                                                                          91707       2024-02-08  12:59:15.234808  aicenter-service
wrapper/23.4.1-rc1/wrapper_template.py                                                                                                                                            2811        2024-02-08  12:59:15.345534  aicenter-service

object를 삭제해도 garbage collector가 실행되기 전까지는 여유 공간으로 반환되지 않습니다. 즉시 반환되도록 하려면 아래와 같이 garbage collector를 실행시켜주면 됩니다. 이 커맨드는 pod의 bash 세션에서 실행하는 것이 아니고 서버의 ssh 세션에서 실행해야 합니다.

if [[ "$(kubectl -n rook-ceph exec deploy/rook-ceph-tools -- radosgw-admin gc list --include-all | jq 'length')" -gt 0 ]]; then
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')] Running gc process"
  kubectl -n rook-ceph exec deploy/rook-ceph-tools -- radosgw-admin gc process --include-all
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')] Completed gc process with exit code: '$?'"
else
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')] GC run not required"
fi

Topic	Replies	Views
How to Monitor Remaining Storage in Automation Suite and Resolve ObjectStore PVC Capacity Issues in Rook-Ceph Knowledge Base automation_suite_deployment_and_operatio	44	January 3, 2025
Pre-checks before performing the upgrade task Knowledge Base automation_suite_deployment_and_operatio	39	January 3, 2025
Automation Suite Storage Reclamation Patch Knowledge Base automation_suite_deployment_and_operatio	17	January 3, 2025
How To Debug Failed Automation Suite Installations? Knowledge Base automation_suite_deployment_and_operatio	540	August 8, 2023
About the Automation Suite category Automation Suite	943	April 25, 2022

Automation Suite 잔여 저장 공간을 확인하는 방법

Related topics