Encountering issues while extracting text from a PDF using the 'Get Text' activity

Hi Team,

Doing UI automation for PDF open with “MS Edge” than using get text activity fetching dynamic values facing some accuracy issues.
Example - i am getting only IDX value which is static ok for all different PDF.
while testing the scenario with different PDF it happens like out of 10.
9 PDF correctly getting values for 1 PDF its picking up another value.
Also i had check with the selector seperate seperate there is no change in selector IDX as well.

Note - OCR used harder to relate with Regex so building this PDF reading with edge and fetching values with get text activity UiPath.
Now i am stuck at this RCA why is this happening

1 Like

Hey @suraj_singh3,

PDF rendering in Edge can shift elements, causing inconsistent results with Get Text even if selectors look the same. Avoid relying on IDX since it changes based on document layout. Use Read PDF Text/Read PDF with OCR or Document Understanding for stable extraction across PDFs.

@Mir.Jasimuddin See i tried with OCR but the page count is more and the data which i am getting with OCR its harder to apply regex on it.
See here challenge is i can’t share the PDF since it is confidential its FORM 16 PDF.
Earlier they used to do with DU but now they need alternate solution so.

Hi @suraj_singh3,

Well in that case can you Use Document Understanding framework for structured PDFs like Form 16 with keyword classifiers and Regex or ML extractors for accuracy.

Buddy as I mention they need to exclude DU they are looking for alternate solution due to cost factor.

@suraj_singh3

  1. Not all DU components consume AI units, Regex extractor or form extractor will not consume AI units .as form 16 is structured you can try with form extractor and give the layout
  2. Generally for no PDF ui automation is a good option so dont rely on it
  3. if the form 16 is coming in electronic format and if it is coming as form based pdf(that is the values to be fetched are like interactive windows) then you can use itext7 to get all the values. example below
  4. Also while reading pdf , you might not need ocr if its a electronic pdf, there is an option to preserve the format in read pdf which can be leveraged to get the values more structured then regex also can be used to extract the data as the structure is preserved when reading

cheers

@Anil_G thanks for your insight will check and let you know

1 Like