Looping pdf files in the folder and extracting particular data from each pdf file

I have a folder name uipath in c:\uipath directory. I used string[] pdfList =Directory.GetFiles(“C:\uipath”) to get the pdf files. Then later I used For Each activity to loop the files.

The pdf file contains image text. I used tesseract OCR activity to extract the data. But now I want to extract specified data from it.

Then later I need to again extract specified data from another pdf. like (name, address, designation).

All the data is in image format in pdf file.

Then later I need to send this data into a csv file. So I’m stuck in the middle of this. how should I do this process?

Please tell me what to do in the next steps?

you can use string manipulation operations to extract specific value. (example- substring,split, index etc)
Or
To get values from a image - you can also use AI.ComputerVision activities.

can you give me a syntax or an example ?. I’m a beginner at using data manipulations

Hi ,

Extract pdf data in a result variable by using assign activity.
suppose in the pdf text i have below data and i want date then i will do like this-

Result-
This is paragraph
Date 1-09-2019
This is End.

To get Date value-
Result.Substring(Result.IndexOf(“Date :”)+“Date :”.Length).split(Environment.NewLine.TocharArray)(0)

2 Likes

if your pdf files are well structured and you only want small parts of them, then you better of scrapping the data from Adobe Reader…

@bcorrea you can’t extract anything if it is an image. OCR is the only way

Did i say that he wouldnt be using ocr for scraping tho? i just said is not always the best choice to use Read PDF with OCR activity if you only need small pieces of structured pdf files, computer vision (per example) does have ocr behind the scenes for things that cant use selectors, but is a lot easier to use…

@bcorrea but how do you know where to scrape on the screen? Unless you know the page number i’m not sure how would help. And if you know the page number then you can include just that page as one of the properties of the read pdf activity. As a general rule at my company we still aren’t using OCR due to accuracy issues in critical applications, so I don’t have a ton of practice with it, so i could definitely be wrong

Usually people dealing with image pdfs are single pages or structured stuff, so the know where things are, but yes the process would have to navigate the pdf to be visible before scarping… Computer Vision is quite new, give it a go when you can, it is fun to see what it does.

2 Likes