Looping pdf files in the folder and extracting particular data from each pdf file

Moola_Kommalu · October 17, 2019, 8:52am

I have a folder name uipath in c:\uipath directory. I used string[] pdfList =Directory.GetFiles(“C:\uipath”) to get the pdf files. Then later I used For Each activity to loop the files.

The pdf file contains image text. I used tesseract OCR activity to extract the data. But now I want to extract specified data from it.

Then later I need to again extract specified data from another pdf. like (name, address, designation).

All the data is in image format in pdf file.

Then later I need to send this data into a csv file. So I’m stuck in the middle of this. how should I do this process?

Moola_Kommalu · October 17, 2019, 9:12am

Please tell me what to do in the next steps?

radhagangwani90 · October 17, 2019, 9:42am

you can use string manipulation operations to extract specific value. (example- substring,split, index etc)
Or
To get values from a image - you can also use AI.ComputerVision activities.

Moola_Kommalu · October 17, 2019, 9:54am

can you give me a syntax or an example ?. I’m a beginner at using data manipulations

radhagangwani90 · October 17, 2019, 4:23pm

Hi ,

Extract pdf data in a result variable by using assign activity.
suppose in the pdf text i have below data and i want date then i will do like this-

Result-
This is paragraph
Date 1-09-2019
This is End.

To get Date value-
Result.Substring(Result.IndexOf(“Date :”)+“Date :”.Length).split(Environment.NewLine.TocharArray)(0)

bcorrea · October 17, 2019, 4:53pm

if your pdf files are well structured and you only want small parts of them, then you better of scrapping the data from Adobe Reader…

Dave · October 17, 2019, 5:31pm

@bcorrea you can’t extract anything if it is an image. OCR is the only way

bcorrea · October 17, 2019, 5:36pm

Did i say that he wouldnt be using ocr for scraping tho? i just said is not always the best choice to use Read PDF with OCR activity if you only need small pieces of structured pdf files, computer vision (per example) does have ocr behind the scenes for things that cant use selectors, but is a lot easier to use…

Dave · October 17, 2019, 5:48pm

@bcorrea but how do you know where to scrape on the screen? Unless you know the page number i’m not sure how would help. And if you know the page number then you can include just that page as one of the properties of the read pdf activity. As a general rule at my company we still aren’t using OCR due to accuracy issues in critical applications, so I don’t have a ton of practice with it, so i could definitely be wrong

bcorrea · October 17, 2019, 5:52pm

Usually people dealing with image pdfs are single pages or structured stuff, so the know where things are, but yes the process would have to navigate the pdf to be visible before scarping… Computer Vision is quite new, give it a go when you can, it is fun to see what it does.

Topic		Replies	Views
Extract pdf specific data Help pdf , activities , data_scraping , string , question	4	4238	November 27, 2019
Looping through PDF files to extract specific selected data Academy Feedback	4	1815	June 28, 2019
Scenario pdf data extraction Help	7	998	October 24, 2019
How to extract text from pdf files placed in a folder Help	1	5791	June 15, 2017
How to extract and validate data from PDF files Help pdf , activities , data_scraping , question	16	3850	November 23, 2019

Looping pdf files in the folder and extracting particular data from each pdf file

Related topics