Pdf automation(invoice extraction)

anjani_priya · February 8, 2024, 5:11am

In a folder if I have both scanned pdfs and pdfs how to extract from both pdfs with one code.how to extract the text and save in excel.

Parvathy · February 8, 2024, 5:12am

Hi @anjani_priya

You can use Read PDF with OCR activity. This will extract text for both scanned pdfs and PDFs and use Regular Expressions to extract the required data to write the data to excel.

Regards

anjani_priya · February 8, 2024, 5:14am

while extracting the text in scanned pdfs , If i didnt specify correct size then iam getting error as empty but every pdf file filed text has different sizes

Parvathy · February 8, 2024, 5:17am

Hi @anjani_priya

Can you share the PDF’s which has different font size, if it doesn’t has confidential information.

Regards

anjani_priya · February 8, 2024, 5:22am

it is confidential information

anjani_priya · February 8, 2024, 8:41am

iam getting empty I have used regex

Parvathy · February 8, 2024, 8:43am

@anjani_priya

If possible can you share the text. I will help you with regex

Regards

anjani_priya · February 8, 2024, 8:47am

Can you help me with the regex for this same pdf
wordpress.pdf (42.6 KB)
my invoice is scanned pdf

Parvathy · February 8, 2024, 8:51am

@anjani_priya

What should be extracted.

Regards

anjani_priya · February 8, 2024, 8:52am

Invoice Number
Order Number
Invoice Date
Due Date

Parvathy · February 8, 2024, 9:25am

Hi @anjani_priya

Check out the below workflow:

Sequence21.xaml (13.4 KB)

So, below are the specifications:

In Read PDF with OCR change the Image DPI from 150 to 270

In Tesseract OCR engine the scaling field will be empty so give 2.5

When you write your data to text file you will get | symbols, so replace that particular | with empty.

Hope you understand!! @anjani_priya

anjani_priya · February 9, 2024, 7:20am

IAM GETTING THESE SYMBOLS IN THE OUTPUT

Parvathy · February 9, 2024, 7:21am

Hi @anjani_priya

You can replace that symbols with empty also.

Regards

anjani_priya · February 9, 2024, 7:24am

IAM GETTING MANY OTHER SYMBOLS IN THE OUTPUT

Parvathy · February 9, 2024, 7:24am

@anjani_priya

Can yo u specify what are the special characters you are getting

Regards

Topic		Replies	Views
Best activity for extract text from pdf Activities pdf , activities , studio	3	155	June 4, 2024
Invoice extraction rpa challenge Studio studio , question , workflow_diff	10	1906	September 16, 2022
Extraction in Invoice Problem Studio studio , question , activities_panel	30	2296	March 16, 2021
Unable to extract specific data from scanned pdf Help pdf , activities , question	6	1153	January 24, 2020
Pdf data extraction to excel file Help excel , pdf , activities	3	877	January 23, 2020

Pdf automation(invoice extraction)

Related topics