Extraction of data from multiple images in PDF/Doc Files


I have multiple pdf file. Each file has timesheet snapshots by candidates for the whole month (divided by weeks). ie. for the month of July each pdf will have 4 images( for 4 weeks) and we have around 100 such candidates.
I need to scroll through each of these documents, scan individual images , extract the total hours, category etc., update those values in the corresponding columns in the excel that I am maintaining.

I am finding extracting data from images challenging. Could you let me know the quickest and the best method to extract data from multiple images in PDF and then update the relevant column in the excel with the information extracted?

Any guidance would be much appreciated. Thanks !

Hi @Purnima_Sambasivan

Read the PDF files using the Read PDF With OCR activity.
You have to use the OCR engine to read the files. e.g., Tesseract OCR
Save the pdf text into a variable
Use Regular expressions to extract the required information from the text.
Write the information in a data table
Repeat the above steps for all the pdf files
Finally, write the data table to the excel file.

For reference

Extraction of Data From multiple images in PDF Files.xaml (11.5 KB)

@kumar.varun2 Thank you So much Varun ! I tried with PDF and OCR extraction using both Tesseract and Microsoft. It is not extracting properly. I am attaching the sample files. Tried using your code as well. This will be of great help. Please note that I need to extract “Total Value”, “posted” value and Input type code. The location of input type code keeps varying. Kindly do help resolve this issue. Thanks a lot !

TestFolder.zip (635.9 KB)

Maybe you could go through all of them using different stages and also comparing the results from each OCR Engine.

I mean, invest time in your Regex, create Regex to the same information or field you want, and then create samples where you can compare the results.

then at then, you can set conditions in your automation, which data extracted fits better. it means, for example: A Name, maybe the filed has better results in the Tesseract stage comparting to Microsoft OCR, then the field, Microsoft OCR can have better results comparing to Tesseract.

And also, your automation can store the best results in data table then when its done, you write all in an excel.

It will be to you dinamic checkings and validations with machine learning principals (settings predictions within the code).

i hope it was helpful.

thank you

good luck in your challenge.

Thank you so much Wagner for the ideas. Will try them out