Best activity for extract text from pdf

Hi,
I need to extract text from scanned invoice pdf and editable pdf , which activy best for both, because both type of documents are available in folder. i tried MicrosoftOCR, Tesseract , OminPage ocr enginess, but not working, i used enterprise license, please give suggestion to achive this.

note: i dont have API key, so please suggest the local activity

HI @ananthitamilmani

Use the read pdf with OCR and use the tesseract or else omni and observe the text is extracting as per your requirement or not and if the text is extracting properly then use the Regex expressions to extract the data.

Please change the values of dpi and scale to extract the data properly while using the ocr

Regards

@ananthitamilmani,

For digital PDF, use Extract PDF Text for Scanned pdfs for offline usage have limitations to OCR engines only. Try playing around the settings of them.

Thanks,
Ashok :slight_smile:

@ananthitamilmani

  • Improve PDF Quality: If possible, pre-process the PDFs to improve OCR accuracy. This might involve deskewing, scaling, or noise reduction.
  • Refine Regular Expressions: Once you have the extracted text, use regular expressions to filter and extract specific data points from the invoice (e.g., invoice number, amount).
  • Document Understanding (for complex layouts): For complex invoice layouts, consider using UiPath’s Document Understanding functionality (available in Enterprise Edition).