Extract data PDF Issue

Hi all,

I’m working on an automation where I need to extract information from a quote and arrange it in a certain way. I need Part Numbers, Descriptions, and Prices.

I found a way to extract the Part Numbers with the help of RegEx. However, When I started working on the description extraction process, I noticed that some of the line items were captured, maybe, not in the best way, and are as below:

LINE NO. PART NO. DESCRIPTION LIST PRICE QUOTE PRICE QTY EXTENDED PRICE

1 210-BGNZ Dell Mobile Precision Workstation 7780 CTOG $7,279.07 $4,705.92 OM 27 $127,059.84
TAA
Dell Federal Systems L.P. c/o Dell USA L.P. -
210-BGNZ

2 379-BBBW TAA Information $0.00 $0.00 OM 27 $0.00
Dell Federal Systems L.P. c/o Dell USA L.P. -
379-BBBW

3 583-BHBG English US backlit keyboard with numeric $0.00 $0.00 OM 27 $0.00
keypad, 99-key
Dell Federal Systems L.P. c/o Dell USA L.P. -
583-BHBG

  • The number 1, 2, 3 is the line item number.
  • 210-BGNZ, 379-BBBW, and 583-BHBG are the part numbers of each line item.
  • The part with bold letters is the description

The thing is that, as I said, I was able to extract the part numbers, but I can’t use a similar process for the description because sometimes the description is in 1 line (easy) but others, like in line item 3, it’s after the price. It looks like the read PDF activity caught a hard return, so the rest of the description is at the end of the line.

Can anybody help me get around this situation?

Hi @mardoza ,

Could you maybe try Enabling PreserveFormat in Read PDF Text activity and then check if the Logical pieces are properly aligned for Extraction ? We might need to change the Extraction method used earlier.

Hi @supermanPunch,

Thank you for your quick response. I enabled that property, and I got the below.

I’m using an image to preserve the format here. That is a Notepad file. Still not sure how to work around it. Any idea?

Update

Using RegEx, I was able to move closer to the solution. I have this:

1       210-BGNZ                         Dell Mobile Precision Workstation 7780 CTOG                   $7,279.07           $4,705.92     OM         27               $127,059.84

Again, the above is my way of preserving the format.

How can I extract only the description?

Ok. This is the text I extracted from the PDF:

I’m attaching the workflow, so you have a better understanding.

Extraction Processes.xaml (36.5 KB)

I need to extract the highlighted portions. That is the description for each product line.
The problem I find is in lines like # 4 and 8 that I need to take into account that the description isn’t in just 1 line. Using RegEx is like the prices and QTY are in the middle of my relevant text, and I need to remove them.

Any idea?

Hi @mardoza,

Can you provide the PDF? I did a similar table extraction from a PDF recently.

Hi,

I’m attaching part of the file. Lines 17, 19, and 21 can be used as they have the description in different lines. I’m also including the text file I get when I read the PDF with a UiPath activity.


file in text.txt (19.8 KB)
Dell HW Quote short.pdf (172.9 KB)