Extract data PDF Issue

mardoza · November 3, 2023, 3:27pm

Hi all,

I’m working on an automation where I need to extract information from a quote and arrange it in a certain way. I need Part Numbers, Descriptions, and Prices.

I found a way to extract the Part Numbers with the help of RegEx. However, When I started working on the description extraction process, I noticed that some of the line items were captured, maybe, not in the best way, and are as below:

LINE NO. PART NO. DESCRIPTION LIST PRICE QUOTE PRICE QTY EXTENDED PRICE

1 210-BGNZ Dell Mobile Precision Workstation 7780 CTOG $7,279.07 $4,705.92 OM 27 $127,059.84
TAA
Dell Federal Systems L.P. c/o Dell USA L.P. -
210-BGNZ

2 379-BBBW TAA Information $0.00 $0.00 OM 27 $0.00
Dell Federal Systems L.P. c/o Dell USA L.P. -
379-BBBW

3 583-BHBG English US backlit keyboard with numeric $0.00 $0.00 OM 27 $0.00
keypad, 99-key
Dell Federal Systems L.P. c/o Dell USA L.P. -
583-BHBG

The number 1, 2, 3 is the line item number.
210-BGNZ, 379-BBBW, and 583-BHBG are the part numbers of each line item.
The part with bold letters is the description

The thing is that, as I said, I was able to extract the part numbers, but I can’t use a similar process for the description because sometimes the description is in 1 line (easy) but others, like in line item 3, it’s after the price. It looks like the read PDF activity caught a hard return, so the rest of the description is at the end of the line.

Can anybody help me get around this situation?

supermanPunch · November 3, 2023, 3:33pm

Hi @mardoza ,

Could you maybe try Enabling PreserveFormat in Read PDF Text activity and then check if the Logical pieces are properly aligned for Extraction ? We might need to change the Extraction method used earlier.

mardoza · November 3, 2023, 3:45pm

Hi @supermanPunch,

Thank you for your quick response. I enabled that property, and I got the below.

I’m using an image to preserve the format here. That is a Notepad file. Still not sure how to work around it. Any idea?

mardoza · November 3, 2023, 4:33pm

Update

Using RegEx, I was able to move closer to the solution. I have this:

1       210-BGNZ                         Dell Mobile Precision Workstation 7780 CTOG                   $7,279.07           $4,705.92     OM         27               $127,059.84

Again, the above is my way of preserving the format.

How can I extract only the description?

mardoza · November 16, 2023, 2:28pm

Ok. This is the text I extracted from the PDF:

I’m attaching the workflow, so you have a better understanding.

Extraction Processes.xaml (36.5 KB)

I need to extract the highlighted portions. That is the description for each product line.
The problem I find is in lines like # 4 and 8 that I need to take into account that the description isn’t in just 1 line. Using RegEx is like the prices and QTY are in the middle of my relevant text, and I need to remove them.

Any idea?

rajneesh94 · November 17, 2023, 6:40am

Hi @mardoza,

Can you provide the PDF? I did a similar table extraction from a PDF recently.

mardoza · November 17, 2023, 1:20pm

Hi,

I’m attaching part of the file. Lines 17, 19, and 21 can be used as they have the description in different lines. I’m also including the text file I get when I read the PDF with a UiPath activity.

file in text.txt (19.8 KB)
Dell HW Quote short.pdf (172.9 KB)

Topic		Replies	Views
Extracting PDF Unstructured Data with irregular format Studio studio , question , activities_panel	13	4136	June 24, 2021
Extract specific data from lined PDF Help pdf , activities , regex , question	5	1073	January 22, 2020
I need to extract all the details from invoices Studio studio , question , activities_panel	20	1041	August 28, 2023
Hello i want to extract description from pdf, each description is under each message box, how can i implement? Activities pdf , activities , question	1	682	June 17, 2021
Extract multiple lines from pdf Activities pdf , activities , question	4	800	June 7, 2022

Extract data PDF Issue

Related topics