If we have multiple scanned pdf with different formats how can we extract the data

If we have multiple scanned pdf with different formats how can we extract the data …I tried using get ocr text activity and tried to extract from one pdf …But i have to extract data from multiple pdfs…Anyone please help me to resolve the issue.Im not using orchestrator for this.so i cant proceed with load taxonomy

@Chippy_Kolot

  • Since you have different formats first you have to identify the format then apply OCR. For each different format there should be some unique field. Based on that differentiate the formats

@Chippy_Kolot

you can use the Regex for extracting the data.

Find out the link for more info:
https://docs.uipath.com/activities/docs/read-pdf-with-ocr

Hello @Chippy_Kolot

You can do it 3 ways

  1. Using regex: Read the pdf and using the regex expressions get the required data. (you can use match activity or directly using string manipulation).
    2)Using get text activity: You need to open the pdf in a pdf reader, then using Get Text need to get the data. But you need to tag to proper anchors.
    3)Document understanding: Using the predefined ML models or you can create your own.

Document Understanding (AI Fabric) is the Best way to pull out data from the different Format Multiple PDF, You can use the Custom Machine Leaning Models and Data Manager.