Extract dynamic Page PDF to Excel

hi, i need to extract PDF with multiple pages(for example 10 Pages) how i can extract 10 pages to Excel, is it possible UI Path?
Thanks.
-lix-

Hi @mycroft7

It is possible to extract dynamic data from PDF & save it in excel using UiPath. The approach depends on the type of PDF you have.

  1. If you have a structured (digital) PDF, you can use the ‘Get Text’ activity to get the desired data from the fixed structure of the PDF. You can also read the PDF text using any OCR & extract the text by applying RegEx.
  2. If you have semi-structured or unstructured PDFs (Scanned), You can go with the option of Document Understanding where you define the taxonomy & extract the data by the application of Machine Learning Packages.

Hope this helps,
Best Regards.

1 Like

Yes, its structured PDF and same format only the data is dynamic, if using gettext how to get Text can move to Page 2,

Thanks.

Hi @mycroft7 ,

Could you let us know if you could provide a Sample PDF document ? We would also require to know what is the output format required in Excel sheet. If you could provide screenshots/Excel representation we will be able to help you better and faster.

HI @mycroft7,

Can you share some sample file so that we can provide you with a solution

Here sample PDF that i want to exctract also excel file that i want to build. Thankss
Sample pdf.pdf (133.3 KB)
Book1.xlsx (10.0 KB)

Here sample PDF that i want to exctract also excel file that i want to build. Thankss

Book1.xlsx (10.0 KB)
Sample pdf.pdf (133.3 KB)

please help for this case. thanks.

@mycroft7 ,

Apologies for the Late reply.

Could you check with the below workflow :
PDF_GetBreakdownData_ToExcel.zip (109.4 KB)

  1. Firstly, we will be Splitting the Data into Sections, as we would need the data for Each Order No to Quantity Total using the Below Regex :
Order No:[\s\S]+?Total[\s\S]+?Quantity:.*
  1. Next, For Each of the Splitted Sections, we capture the required values Order No., Product Name, Sizes (S, M, L, XL) using the below Regex :

Order No. :

(?<=Order No:\s*).+?\s

Product Name :

(?<=Product Name:\s*).*

Size / Color Breakdown :

(?<=Size \/ Colour breakdown\r?\n).*

All the Sizes along with the Total Quantity, we’ll be able to capture using the below Regex by using grouping :

Total\s*S\s\(S\)\*(?<Small>.*)\s*M\s\(M\)\*(?<Medium>.*)\s*L\s\(L\)\*(?<Large>.*)\s*XL\s\(XL\)\*(?<XL>.*)\s*Quantity:(?<Total>.*)

Let us know if you were not able to get the required output.

1 Like

its working SupermanPunch… Many thanks,
but i have 1 left case, i try with all data (27 Page), but i have 2 page that not standart as other pages (no Size S). Like Below(on Page 9),


And the extracted Excell not working well as below. (see row number 10, its belong to pages 9)

Pls advise

Thanks Superman

@mycroft7 ,

Could you replace the Regex assigned for mc_Sizes to the below and Check if it is able to get the data :

Total\s*(S\s\(S\)\*(?<Small>.*)\s*)?(M\s\(M\)\*(?<Medium>.*)\s*)?(L\s\(L\)\*(?<Large>.*)\s*)?(XL\s\(XL\)\*(?<XL>.*)\s*)?Quantity:(?<Total>.*)

i have try, its correct for pages 9, for the rest extract excel still empty.

Thanks

@mycroft7 ,

We might need to check on the difference in the Patterns for those data, Is it possible to provide the PDF sample for those data which is failing ?

hi superman, herewith sample pdf file, with different size as attached.
Sample pdf.pdf (142.8 KB)

Thanks a lot superman.

@mycroft7 ,

I dont think the PDF provided contains the failing data. Because I was able to get the output using the Regex provided :

The Data such as Mexico, Malaysia and Thailand is not available in the sample PDF you have provided.

sory below data sample include for Mexico and Malaysia
Thanks.
Sample pdf.pdf (152.0 KB)

@mycroft7 ,

There were two different forms of representation identified :

  1. Small = S or CH
  2. Large = L or G
  3. XL = XL or XG

Next, Inside the Brackets there were variations, it was not only (S) but (S/S)
image

The First Part of Recognising the different forms of Sizes (S, CH for small) should be known already and we can incorporate it into the regex as did in the modified below regex. The Second part of changing values inside the brackets is dynamically fixed and it should work if there are dynamical values inside the bracket.

However, a thorough check with different samples and format that you would get needs to be done, so that we can confirm that the Modified regex below works properly.

Check the below modified regex and let us know if it doesn’t work for all cases :

Total\s*((S|CH)\s\(.*?\)\*(?<Small>.*)\s*)?(M\s\(.*?\)\*(?<Medium>.*)\s*)?((L|G)\s\(.*?\)\*(?<Large>.*)\s*)?((XL|XG)\s\(.*?\)\*(?<XL>.*)\s*)?Quantity:(?<Total>.*)

Thanks for the explanation, after replace the regex as you mention recently, its show like below, there’s no amount in every column Total.

@mycroft7 ,

Apologies, there was a typo mistake in the post above. I have corrected it. The last ``` is not required in the regex expression.