I have tried using Document Understanding to approach this project, and learn a lot about UiPath, but didn’t help with my project. Due to the speed, I have decided to use PDF to Text and manipulate the string instead of brute force. The project has 50 thousand pages of PDF pages, about 50 files total. So, each file is about 1000 pages. I put the format of 4 pages attached as example.
Invoice Examples.pdf (39.2 KB)
Requirements: (highlight text in PDF blue for contrast)
-
Only interested to extract “Special Service” table. So 1, 2, 4 pages are needed as no Special Service in page 3 invoice. I initially brute forced through, just extracted each PDF page separately to single-page document and then scan for “Special Service” keyword to delete unwanted pages. Took me 3 hours to process one 1000-page PDF file. The numbering in the PDF document is messed up in relationship to the 1000-page PDF file, so I can’t use it.
I am thinking to first extract the entire 1000-page pdf to text, and use the “Thank you for shopping with Us” as a page breaker to identify the page. Then, search for “Special Service” to delete unwanted pages.
I quickly got stuck on the new “For Each” interface dealing with “Matches.” Appreciate some guidance on how to go about doing this. -
the Invoice, Customer Name, and email are straight-forwarding us Regex.
-
the Special order table is another issue, I used PDF to Text (unformatted) and formatted. The formatted seem to be better as the description seems to line up, but Regex is not help. I also tried Document Understanding approach and failed miserably with the Description column’s 2nd and 3rd line of text as well as the Service Code. For the most part, I can only get the first line of text using Regex. I don’t need the title row of the “Special Service”.
I have also tried using “Split to Left/Right” to try to capture the borderless table format with “Generate DataTable from Text” option wizard.
Appreciate some help from experts in this forum. Thanks in advance.
Bei