Extraction of data from multiple images in PDF/Doc Files

Purnima_Sambasivan · August 9, 2021, 10:19am

Hi,

I have multiple pdf file. Each file has timesheet snapshots by candidates for the whole month (divided by weeks). ie. for the month of July each pdf will have 4 images( for 4 weeks) and we have around 100 such candidates.
I need to scroll through each of these documents, scan individual images , extract the total hours, category etc., update those values in the corresponding columns in the excel that I am maintaining.

I am finding extracting data from images challenging. Could you let me know the quickest and the best method to extract data from multiple images in PDF and then update the relevant column in the excel with the information extracted?

Any guidance would be much appreciated. Thanks !

kumar.varun2 · August 9, 2021, 10:49am

Hi @Purnima_Sambasivan

Read the PDF files using the Read PDF With OCR activity.
You have to use the OCR engine to read the files. e.g., Tesseract OCR
Save the pdf text into a variable
Use Regular expressions to extract the required information from the text.
Write the information in a data table
Repeat the above steps for all the pdf files
Finally, write the data table to the excel file.

For reference

Extraction of Data From multiple images in PDF Files.xaml (11.5 KB)

Purnima_Sambasivan · August 13, 2021, 4:23am

@kumar.varun2 Thank you So much Varun ! I tried with PDF and OCR extraction using both Tesseract and Microsoft. It is not extracting properly. I am attaching the sample files. Tried using your code as well. This will be of great help. Please note that I need to extract “Total Value”, “posted” value and Input type code. The location of input type code keeps varying. Kindly do help resolve this issue. Thanks a lot !

TestFolder.zip (635.9 KB)

wagner · August 13, 2021, 10:07am

Maybe you could go through all of them using different stages and also comparing the results from each OCR Engine.

I mean, invest time in your Regex, create Regex to the same information or field you want, and then create samples where you can compare the results.

then at then, you can set conditions in your automation, which data extracted fits better. it means, for example: A Name, maybe the filed has better results in the Tesseract stage comparting to Microsoft OCR, then the field, Microsoft OCR can have better results comparing to Tesseract.

And also, your automation can store the best results in data table then when its done, you write all in an excel.

It will be to you dinamic checkings and validations with machine learning principals (settings predictions within the code).

i hope it was helpful.

thank you

good luck in your challenge.

Purnima_Sambasivan · August 25, 2021, 12:27pm

Thank you so much Wagner for the ideas. Will try them out

Topic		Replies	Views
Multiple PDFs Image Data Extraction IT Automation	0	832	April 7, 2020
Extracting data through pdf using ocr and store it into excel Help studio	6	1957	November 20, 2019
Looping pdf files in the folder and extracting particular data from each pdf file Help	9	3918	October 17, 2019
Easiest Data extraction methods from scanned pdf Activities ocr , activities , computer_vision , question , document_processing , data-extraction , pdf-extraction , ocr-engine , uipath-ocr , cv-screen-scope , cv , anchor , scanned-doc , uipath-screen-ocr	6	95	August 21, 2024
Extract scanned PDF to excel Studio	5	5050	August 16, 2020

Extraction of data from multiple images in PDF/Doc Files

Related topics