Need help for Invoice extraction using Reg Ex

Hi All,


Please help me with the pdf data extraction using regex,
in the attached pdf the header is in 2 rows and (line items is also in 2 rows for single line)

I want to extract the data into excel as per given example below… please help
PRODUCT ID.pdf (53.3 KB)

Hi @ashwini.bagewadi

Yes, we can extract the required values using Regular Expression but not able to open your PDF and getting message it is either corrupted or unstructured.

Also, the regex approach will be appropriate only if the rows is always limited in your input or Product ID file. If you get more number of rows, it would be best to check an option to convert the data in table format into excel file.

Please check if the below activity helps you

Thanks,

Thank you so much @Boopathi.M for your response. I will attach the pdf file once again. It will be really helpful if you let me know if I can extract from it. And thank you for the suggestion, I will look into the link you have send to me.
PRODUCT ID.pdf (53.3 KB)

Hi @ashwini.bagewadi

Thank you for sharing the PDF file again. Able open the PDF file and extracted the data using Regex.

This is the output

Step 1 - Install Package UiPath.PDF.Activities and use Read PDF in text file and save the output
Step2 - In that output text file, extracted text removing headers using this regex pattern - “(?<=Extended Cost)[\s\S]*” This will extract all the data after header and it is greedy i.e extract everything after headers… if there is some constant value at the end of row or some fixed pattern to identify we can make it as non greedy
Step3 - Now whatever data you have - use regex to identify the values - used the below pattern to identify the values using Named Group and it will be saved as collection

Step 4 - Used foreach to extract each group name and adding the values into a datarow which is built before reading data.

Step 5 - Write the Datatable into excel file.

Attached the xaml for your reference.

RegularExpression.zip (58.6 KB)

Note - If the structure of the PDF changes, then data output will also change.

Thanks,
Boopathi

Thank you soo much @Boopathi.M . Your solution worked perfectly fine!

Hi @Boopathi.M

I am trying your solution with my data, but im unable to create groups in Reg Ex builder, Ex. PRODUCT ID1(this is my product id DCD797C3 ), PRODUCT ID2(1001311467). Can you please help me on how to create the group of PRODUCT ID.

Thanks,
Ashwini

Hi @ashwini.bagewadi

You need to append Product 1 and Product 2 to get the product ID.

Could you please provide screenshot of the error or the xaml you are using?

I will check it out.

Thanks

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.