RegEx help and saving to Excel

I need to read from here:
"Invoice No. 7334/461
A.B.N 26 008 672 179 Invoice Date 21/03/2021
Level 3, 25 Rowe Avenue, Rivervale WA 6103
SEQ97318 Account No. 31177
08:07 Order No. 994 HUNTERS HILL
J4SvTAfaeP Date Order Received 21/03/2022

994 HUNTERS HILL Page 1 of 1
4470390 CLEANER GLASS SIMPLE GREEN 750ML RTU 00168 1 EACH 6.00 D 6.00 0.60 6.60
0065280 GLOVES RUBBER SABCO 3PK LATEX MED SAB80001 1 EACH 5.00 5.00 0.50 5.50
4471041 BRUSH NAIL SABCO NAIL BRUSH 25007 1 EACH 2.91 2.91 0.29 3.20
4460431 BRUSH MR CLEAN SHOE PB477 1 EACH 2.99 2.99 0.30 3.29
9036011974504599473 TOBY MERRELL

16.90 1.69 18.59"

and enter to excel , product code, description, quantity, rate and the detail on the last line that says someone’s name (for unstance here- TOBY MERRELL).
i’m not good with Regex and don’t know how to loop through the table and get only needed info? Please, any elo will be great!!

Also, i’m entering it to Excel to columns and i’m looping through the invoices in the folder. Some columns will have many rows and some of them only one. Will UIPath start each new invoice from the new line, even if the previous ones are not filled for each column?

Thank you!

Hi @natasha6 ,

Could you provide us the Expected Output from this Data in the form of Excel ?

This would help us to figure out the fields to be extracted and the order to be extracted.

In addition, you could also provide us Information about the PDF files being used. If the PDF’s are always digital, make sure that you set PreserveFormat property to True in Read PDF Text Activity and then Provide us the Input Text Data.

Also, Just to get Cleared with the Table Structure, if you could provide us the PDF Document, we would be able to suggest some other alternatives if it doesn’t work with Regex.

1 Like

Sorry, how can i preserveFormatting? It is not just a simple check box, not sure how ti set t?

and here is a pdf (always digital, but i had to delete some info from it) and desired outcome in excel. Thank you!
forforum_.pdf (101.0 KB)

@natasha6 , I am not sure How do you get the Output as you have shown. But using Regex Capturing Groups, we can get the Output in the below way :

Also, It is Unlcear as to why the Name TOBY MURRELL should be in the first row? and the Other names in their respective rows.

If we can understand why that is the case, we may be able to help you in getting the output as desired.

because in each invoice there are more than one product! if you look at pdf- 4 products, but only one name. And in other invoices there can be any number of products, from one to 100, but oly one ae at the end. and one invoice number. The desired excel i made myself. and the question was- it is even possible?

@natasha6 , We don’t know the Structure of all the Possible Data, Keeping in mind that there might be multiple names like “TOBY MURREL” that we need to Capture. So Unless, we have the full picture of the Data format where there are multiple Products, I do not think we can come to a Conclusion of Possible/Not Possible just yet.

Especially since the PDF is Digital, we should be able to find some way to extract the relevant data.

However, Below is the workflow that performs Extraction for the PDF Provided, It doesn’t capture the Name. But I do think that it is possible by using String Manipulation.

You Could Check the workflow for Different PDF’s and let us know if it is the same for all, Keeping in mind that the Columns will always be in the same format.
Extract_Table_Regex.xaml (10.7 KB)

If required to assist further with Multiple Products data, we would require you to provide us with the sample data to work on.


FYI, it seems lack of UiPath.PDF.Activities pacakge. Can you try to install it in MangePackages on ribbon menu?


1 Like

you were right!

Thank you! it is working for all pdfs! But not exactly what i’m after… everything i need:
Invoice number, date, job, product code, description, quantity, rate and the detail on the last line that says someone’s name. And i need to loop through all invoices in the folder. I developed reading of Invoice number, date, job from all invoices and successfully entering it to excel. Now just need the rest.

@natasha6 , The above fields were also considered and it was extracted using Regex.

As mentioned before, we would like to get the Pdf which contains multiple Products, then we will be able to create the Correct Logic for Identifying the names.

first pdf had multiple products? or do you mean you need another pdf?
forforum_.pdf (101.1 KB)

@natasha6 We would require another sample of PDF which contains multiple Products.

??? I don’t understand??? can’t you see Many lines for product??

@natasha6 I meant the case having Multiple Names (like the “TOBY MURREL”) in One Pdf

oh, sorry, the name is always one, just not the same every time. but always in the same place

@natasha6 If that is the case, I am not sure if the Expected Output provided was for one PDF ?

Was it for Multiple PDF’s ? Since there were multiple names present.

@natasha6 The Resultant Output after Modification is like below :

Since, the name is always in the Last Line, we can use Split and Pick the Last line and Perform Regex to only Take the Words present.

Then assign it to the First row of the Datatable.

Extract_Table_Regex.xaml (12.1 KB)

Let us know if it doesn’t work.

and here are all tr invoices in a sample folder
36877_INVOICE-CREDIT_28_03_2022_3.pdf (101.8 KB)
36877_INVOICE-CREDIT_28_03_2022_4.pdf (102.1 KB)
36877_INVOICE-CREDIT_28_03_2022_5.pdf (102.2 KB)
36877_INVOICE-CREDIT_28_03_2022_6.pdf (101.9 KB)
36877_INVOICE-CREDIT_28_03_2022_7.pdf (102.2 KB)

36877_INVOICE-CREDIT_28_03_2022_1.pdf (102.0 KB)
36877_INVOICE-CREDIT_28_03_2022_2.pdf (101.7 KB)

@natasha6 ,

Do you want the Output to be in the below way :