How To Fix Looping PVC?

How to fix looping PVC?

Issue Description: How to fix looping PVC ?

Symptoms: This issue does not always present itself as a PVC issue. These are some of the symptoms that can cause it,

  1. A container is not starting an on inspection its waiting for a PVC to be mounted.
  2. In longhorn monitoring the PVC goes from Attached to Detached
  3. In the instance manager logs the following message is seen: level=error msg=\"Error reading from wire: EOF\"

Root Cause: This is most common in a single node deployment. It can happen in an HA cluster but is less common. The underlying cause can be caused by a software fault, disk failure or machine failure.

UiPath is still investigating some of the causes, but we want to stress that if monitoring is enabled, an issue like this can quickly get caught and its more likely that the root cause can be found. Without monitoring it becomes more difficult to find the root cause.

With that in mind, the issue happens when a metadata file is corrupted.

Diagnosing

  1. To see if the issue is happening check the instance manager logs
    1. kubectl -n longhorn-system logs -f -l longhorn.io/component=instance-manager | grep EOF
    2. Here is an example of the error:
      • 2023-02-19T04:31:24.266329853-05:00 stderr F time="2023-02-19T09:31:24Z" level=warning msg="pvc-b67ad316-4506-447e-badc-452cf95b870e-e-51aa8427: time=\"2023-02-19T04:15:27Z\" level=error msg=\"Error reading from wire: EOF\""
    3. Make a note of the pvc name.
  2. Once it has been confirmed the issue is happening, find the node where the error is occurring. (Note, this step is not necessary, if its hard to identify the nodes that are failing, each node can be checked instead)
    1. Go to the monitoring URL. (monitoring.<Fully Qualified Domain Name of Automation Suite>)
    2. From the home page click the Menu icon on the top left, and select 'local' under explore cluster.
image.png
  1. In the next page, go to Longhorn-> Longhorn
image.png
  1. In the page go to 'Volume'
image.png
  1. From here search for the volume in the search field.
image.png
  1. Once the looping pvc is identified, click on its name link to view the details (note the state will not be healthy)
image.png
  1. From this page, try and identify the hosts that the replicas are running on. There will probably be multiple nodes.
    1. Under volume details, there will be an Attached Node & Endpoint.
    2. The replicas will also list the node that they are running on.
    3. Because the pvc keeps attaching and detaching the name may only appear for a few seconds.
    4. Here is an example of the details to look for (node names have been highlighted. In this case the node name is autosuite)
image.png
  1. Login to the node where the issue is failing. Either switch to root our use sudo in the following commands
  2. Locate the pvc directory associated with the failure.
    1. The location of the directory will be under /datadisk/replicas/<pvc name>-<unique identifier>
    2. The following command can be used to change into the pvc directory
      • cd /datadisk/replicas/<pvc name>-*
      • i.e. cd /datadisk/replicas/pvc-6b120731-6443-4699-a473-f12866a80b51*
      • The astrick at the end allows us to change into the directory without knowing the identifier. There should only be one directory on the node per pvc.
  3. In the directory do an 'ls -lrt'. The below screenshot shows an example of the output. This will be used as a reference for the next steps
image.png
  1. The results should show a file called volume.meta that has a size of 0.
    • Here is an example output. The item of interest is line number 8. As can be seen the size is 0.
  2. Once it is confirmed that the we are in the correct directory, we need to gather some information to repair the pvc. Make a not of the following items
    1. The size of the head file. The name will vary, but generally the name will be volume-head-00<number>.img. In the example this is line 9. The size is 107374182400
    2. The name of the volume head (the name of the file in line 9)
    3. The latest snapshot. In this case its 5 and the value is 'volume-snap-uipath-s-8e7570a6-b728-4324-b394-d1f13173ff20.img'
      • We know its a snapshot from the name and the size. The directory is listed by timestamps. The names may very a bit depending on the source of the snaphsot. It might not have 'uipath' in the name.
  3. Next. make a a scatch file using whatever editor is convenient.
  4. In the file paste the following template:
    • {"Size":SIZE,"Head":"VOLUMEHEAD","Dirty":true,"Rebuilding":false,"Error":"","Parent":"LATESTSNAPSHOT","SectorSize":512,"BackingFilePath":""}
  5. Replace SIZE, VOLUMEHEAD and LATESTSNAPSHOT. In our example the final string would look like:
    • {"Size":107374182400,"Head":"volume-head-002.img","Dirty":true,"Rebuilding":false,"Error":"","Parent":"volume-snap-uipath-s-8e7570a6-b728-4324-b394-d1f13173ff20.img","SectorSize":512,"BackingFilePath":""}
  6. Finally open the volume.meta data file that is of size 0 and paste in the string
    1. vi volume.meta
    2. Once vi opens, click i
    3. Paste in the string.
    4. Click the esc key
    5. type ':wq!'
  7. After this step the pvc should recover. Most like its known what pods were affected. They can be restarted.
  8. If the above issue still occurs, please raise a ticket with UiPath support, mention the issue and that this KB was followed, and include the support bundle: Using the Automation Suite Support Bundle tool .