Unable to extract the correct data from PDF

Hi All,

We tried to extract the table from the PDF as shown in the below screenshot(PDF Screenshot.PNG). But we were not able to extract the entire table as it is. when we tried using read PDF activity we have received the output as shown below screenshot(Output.PNG



). this PDF contains all the updated prices of the products so by using the extracted data we are unable to identify the exact price of the product since the information we have capture not aligned. is there any way to extract the entire table as it is. any help would be greatly appreciated.

Hi! is that native PDF or scanned PDF. if Native pdf no need to use Pdf activities. just follow the below steps…

1.Attach window and indicate the pdf and now use the data scraping to extract the entire table. and by using write range you can get this data in to the excel.

if that pdf is a scanned pdf please follow the below steps…

1.Use the Open pdf with ocr and pass the file path and use Tesseract ocr engine to get the data.

if still not getting the the data which you are expecting use get ocr text.

Regards,
NaNi

Hi,

thanks for your response. it is a native PDF we have already tried extracting the content from extract table Ui activity but it is throwing error. after that we have tried by using read PDF text we are able to extract the whole content from the PDF but the data we extracted is not aligned due to some blank price values in the PDF. Could you please read our question and share some useful suggestion.

Regards,
Kirankumar.

can you share me the pdf file? if possible.

have you tried with get full text and get full text with ocr?

Regards,
NaNi

Hi,

i am sharing the PDF for reference. we have already tried the Read PDF OCR text and other methods and even we have tried with the python code to extract the data into data frame but we got the same result with un aligned data please find the output screenshot for your reference. As mentioned in our question we are able to get the data but due the alignment problem we are unable to recognize which is the price related which product.

Regards,
Kirankumar.Rockwood Products.pdf (970.9 KB)

please can you find the attached text file.

is this your required output?

pdf.txt (2.7 KB)

i dumped it in to the text file if you use write csv this will give you the table format

Regards,
NaNi

Hi,

Thanks for your effort on this. in the attached output price values are coming properly but the part numbers are coming wrongly. we have seen this problem when we use the OCR to extract the text from this PDF. eg: part number 11-PB is coming like “how” in the output. in real time we will use the PDF to compare the part numbers and extract the appropriate price. do you have any other alternatives to get the proper data for partnumbers.

Thanks again.

Hi, @kirankumar.mahanthi1

incase your familiar with document understanding you can use “Form extractor” to extract the particular table as it is

Hi,

Yeah we are familiar but currently we dont have document understanding. Could you please suggest any approach without using document understanding.

Regards,
Kirankumar.

Hi @kirankumar.mahanthi1

You can go with “CV table extraction”
otherwise, use Generate Data table Activity after reading the pdf.

try to read with different OCR engines, one of them may give better results than read pdf activity