Need help for Invoice extraction using Reg Ex

ashwini.bagewadi · December 2, 2021, 12:47pm

Hi All,

Please help me with the pdf data extraction using regex,
in the attached pdf the header is in 2 rows and (line items is also in 2 rows for single line)

I want to extract the data into excel as per given example below… please help
PRODUCT ID.pdf (53.3 KB)

Boopathi.M · December 2, 2021, 1:47pm

Hi @ashwini.bagewadi

Yes, we can extract the required values using Regular Expression but not able to open your PDF and getting message it is either corrupted or unstructured.

Also, the regex approach will be appropriate only if the rows is always limited in your input or Product ID file. If you get more number of rows, it would be best to check an option to convert the data in table format into excel file.

Please check if the below activity helps you

Thanks,

ashwini.bagewadi · December 2, 2021, 2:52pm

Thank you so much @Boopathi.M for your response. I will attach the pdf file once again. It will be really helpful if you let me know if I can extract from it. And thank you for the suggestion, I will look into the link you have send to me.
PRODUCT ID.pdf (53.3 KB)

Boopathi.M · December 2, 2021, 6:34pm

Hi @ashwini.bagewadi

Thank you for sharing the PDF file again. Able open the PDF file and extracted the data using Regex.

This is the output

Step 1 - Install Package UiPath.PDF.Activities and use Read PDF in text file and save the output
Step2 - In that output text file, extracted text removing headers using this regex pattern - “(?<=Extended Cost)[\s\S]*” This will extract all the data after header and it is greedy i.e extract everything after headers… if there is some constant value at the end of row or some fixed pattern to identify we can make it as non greedy
Step3 - Now whatever data you have - use regex to identify the values - used the below pattern to identify the values using Named Group and it will be saved as collection

Step 4 - Used foreach to extract each group name and adding the values into a datarow which is built before reading data.

Step 5 - Write the Datatable into excel file.

Attached the xaml for your reference.

RegularExpression.zip (58.6 KB)

Note - If the structure of the PDF changes, then data output will also change.

Thanks,
Boopathi

ashwini.bagewadi · December 3, 2021, 3:40pm

Thank you soo much @Boopathi.M . Your solution worked perfectly fine!

ashwini.bagewadi · December 6, 2021, 5:33am

Hi @Boopathi.M

I am trying your solution with my data, but im unable to create groups in Reg Ex builder, Ex. PRODUCT ID1(this is my product id DCD797C3 ), PRODUCT ID2(1001311467). Can you please help me on how to create the group of PRODUCT ID.

Thanks,
Ashwini

Boopathi.M · December 6, 2021, 7:06am

Hi @ashwini.bagewadi

You need to append Product 1 and Product 2 to get the product ID.

Could you please provide screenshot of the error or the xaml you are using?

I will check it out.

Thanks

system · December 9, 2021, 7:07am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to extract tabular data from an invoice with uipath activity Activities pdf	4	977	August 31, 2022
I need to extract all the details from invoices pdf and line item describtion quantity and all the fields and i need to do this for all pdf files in the folder Studio studio , question , activities_panel	23	2979	June 30, 2021
How to Extract the fields from pdf using Regex Academy Courses studio , regex , question , data_manipulation , activities_panel , linq	8	2137	January 20, 2023
Regex Based Extractor - Table Document Understanding activities , question	6	1677	March 1, 2021
PDF Invoice extraction - Multiline data extraction to excel Activities pdf , activities , question	1	1207	August 25, 2021

Most Active Users - Yesterday
ashokkarale
mkankatala
Parvathy
vrdabberu
sandyarpa767
pravallikapaluri
gantamohan502
indiedev91
naveen.s
Anil_G
More details...

Need help for Invoice extraction using Reg Ex

Related Topics