Project Approach and Process Help - bulk pdf files

Bei_Jing · June 19, 2023, 7:54am

I have tried using Document Understanding to approach this project, and learn a lot about UiPath, but didn’t help with my project. Due to the speed, I have decided to use PDF to Text and manipulate the string instead of brute force. The project has 50 thousand pages of PDF pages, about 50 files total. So, each file is about 1000 pages. I put the format of 4 pages attached as example.
Invoice Examples.pdf (39.2 KB)

Requirements: (highlight text in PDF blue for contrast)

Only interested to extract “Special Service” table. So 1, 2, 4 pages are needed as no Special Service in page 3 invoice. I initially brute forced through, just extracted each PDF page separately to single-page document and then scan for “Special Service” keyword to delete unwanted pages. Took me 3 hours to process one 1000-page PDF file. The numbering in the PDF document is messed up in relationship to the 1000-page PDF file, so I can’t use it.
I am thinking to first extract the entire 1000-page pdf to text, and use the “Thank you for shopping with Us” as a page breaker to identify the page. Then, search for “Special Service” to delete unwanted pages.
I quickly got stuck on the new “For Each” interface dealing with “Matches.” Appreciate some guidance on how to go about doing this.
the Invoice, Customer Name, and email are straight-forwarding us Regex.
the Special order table is another issue, I used PDF to Text (unformatted) and formatted. The formatted seem to be better as the description seems to line up, but Regex is not help. I also tried Document Understanding approach and failed miserably with the Description column’s 2nd and 3rd line of text as well as the Service Code. For the most part, I can only get the first line of text using Regex. I don’t need the title row of the “Special Service”.
I have also tried using “Split to Left/Right” to try to capture the borderless table format with “Generate DataTable from Text” option wizard.

Appreciate some help from experts in this forum. Thanks in advance.

Bei

Topic		Replies	Views
PDF extraction from specific page only Studio	2	1247	August 29, 2021
Read multiple pages of a PDF file Document Understanding	10	4522	April 26, 2022
Extract Pdf pages based on key words Studio studio , question , activities_panel	6	1160	June 6, 2024
I need to extract all the details from invoices pdf and line item describtion quantity and all the fields and i need to do this for all pdf files in the folder Studio studio , question , activities_panel	23	3158	June 30, 2021
Extract Data from one PDF file containing Multiple pages of Invoices Studio excel , database , pdf , activities , studio , question , ml , ai_center , tools	2	3219	April 11, 2022

Project Approach and Process Help - bulk pdf files

Related topics