Unable to extract the correct data from PDF

kirankumar.mahanthi1 · November 8, 2021, 10:16am

Hi All,

We tried to extract the table from the PDF as shown in the below screenshot(PDF Screenshot.PNG). But we were not able to extract the entire table as it is. when we tried using read PDF activity we have received the output as shown below screenshot(Output.PNG

). this PDF contains all the updated prices of the products so by using the extracted data we are unable to identify the exact price of the product since the information we have capture not aligned. is there any way to extract the entire table as it is. any help would be greatly appreciated.

THIRU_NANI · November 8, 2021, 10:22am

Hi! is that native PDF or scanned PDF. if Native pdf no need to use Pdf activities. just follow the below steps…

1.Attach window and indicate the pdf and now use the data scraping to extract the entire table. and by using write range you can get this data in to the excel.

if that pdf is a scanned pdf please follow the below steps…

1.Use the Open pdf with ocr and pass the file path and use Tesseract ocr engine to get the data.

if still not getting the the data which you are expecting use get ocr text.

Regards,
NaNi

kirankumar.mahanthi1 · November 8, 2021, 10:33am

Hi,

thanks for your response. it is a native PDF we have already tried extracting the content from extract table Ui activity but it is throwing error. after that we have tried by using read PDF text we are able to extract the whole content from the PDF but the data we extracted is not aligned due to some blank price values in the PDF. Could you please read our question and share some useful suggestion.

Regards,
Kirankumar.

THIRU_NANI · November 8, 2021, 10:47am

can you share me the pdf file? if possible.

have you tried with get full text and get full text with ocr?

Regards,
NaNi

kirankumar.mahanthi1 · November 8, 2021, 11:03am

Hi,

i am sharing the PDF for reference. we have already tried the Read PDF OCR text and other methods and even we have tried with the python code to extract the data into data frame but we got the same result with un aligned data please find the output screenshot for your reference. As mentioned in our question we are able to get the data but due the alignment problem we are unable to recognize which is the price related which product.

Regards,
Kirankumar.Rockwood Products.pdf (970.9 KB)

THIRU_NANI · November 8, 2021, 11:24am

please can you find the attached text file.

is this your required output?

pdf.txt (2.7 KB)

i dumped it in to the text file if you use write csv this will give you the table format

Regards,
NaNi

kirankumar.mahanthi1 · November 8, 2021, 11:56am

Hi,

Thanks for your effort on this. in the attached output price values are coming properly but the part numbers are coming wrongly. we have seen this problem when we use the OCR to extract the text from this PDF. eg: part number 11-PB is coming like “how” in the output. in real time we will use the PDF to compare the part numbers and extract the appropriate price. do you have any other alternatives to get the proper data for partnumbers.

Thanks again.

VIDHATHA_THADURI · November 8, 2021, 12:03pm

Hi, @kirankumar.mahanthi1

incase your familiar with document understanding you can use “Form extractor” to extract the particular table as it is

kirankumar.mahanthi1 · November 8, 2021, 1:21pm

Hi,

Yeah we are familiar but currently we dont have document understanding. Could you please suggest any approach without using document understanding.

Regards,
Kirankumar.

VIDHATHA_THADURI · November 9, 2021, 10:35am

Hi @kirankumar.mahanthi1

You can go with “CV table extraction”
otherwise, use Generate Data table Activity after reading the pdf.

try to read with different OCR engines, one of them may give better results than read pdf activity

Topic		Replies	Views
I'm Facing issue in Data Table extract from PDF Activities pdf , activities , question	7	1130	November 18, 2021
Need help on extracting data from native PDF Studio pdf , activities , question	21	2450	February 24, 2021
Extract Pdf Data into Excel Help activities	6	1127	March 29, 2019
Extraction of table data from pdf Something Else feedback	8	1040	July 17, 2023
How to extract data from digitize pdf Studio studio , question , activities_panel	4	31	March 28, 2025

Unable to extract the correct data from PDF

Related topics