Screenshot pdf data extraction

Hi All,

I want to extract data from screenshots converted to PDF file. I have tried with pdf text extract option got blank data. any other way to implement.

Thanks in advance
Niranjan

Hi @Niranjan_k

Use Read PDF with OCR activity with Tesseract OCR Engine to extract the text from images and you can use Regular Expressions to extract required data.

Hope it helps!!
Regards

Hi @Niranjan_k

Use read pdf with ocr activity
image
Change the scalings and profile in the properties panel so that you can extract the data correctly.

and also try changing the Image Dpi also

Try changing the scale from 0 and increase upto 5 and each time you increase try to increase by .5 only

Regards

@Niranjan_k

Use Read Pdf text activity then you get the data in string Variable

if it is Scaaned Use Read Pdf with ocr activity

Thanks

@mkankatala sure I’ll try and check

@vrdabberu could you please help me with sample workflow

@rlgandu thank you I will try and check

Hi @Niranjan_k

Could you please share me sample pdf so that I will give you the workflow

Regards

@Niranjan_k

Try Yourself if you struck please ask your doubt

1 Like

@rlgandu I have extracted data to notepad.
I don’t understand where word begging and ending. It’s very difficult to categorise the data. Any other way it can give the data in right format how data exists in PDF

Hi @Niranjan_k

Try changing the scaling in Tesseract OCR engine starting from 0 till 5. Increase scaling by 0.5. Try changing the profile to scan and try once default it will be empty. If possible share the PDF.

Happy to help if difficulties faced.

Regards

@vrdabberu what should I update for profile scale I have updated for 5. Sorry I do not have access to share internal data.

@Niranjan_k

Starting the scaling from 1 and run the workflow check how the data is getting extracted. According to that if the data extracted is not correct increase the scaling by 0.5 and check.

Regards

@vrdabberu I have doubt on what to select for profile we jhave 3 Options Screen, Scan and Legacy. What should I need to select

Hi @Niranjan_k

It will differ from pdf to pdf. So, first try without keeping any profile and change the Image Dpi and scale if it fails to work then change th profile to scale and again try changing the Image Dpi and Scales. Need to check in every aspect because we can’t assure in which scale and in which profile it works.

Regards

@vrdabberu Still im getting same data not in formatted output

HI @Niranjan_k

When using the Read PDF with OCR we will not be able to get the output as the same manner that is present in the PDF file. After using the read PDF with OCR based on the OCR type we will be able to extract the data as paragraph type or in a different manner. We may able to get Format type only when we use the READ PDF TEXT activity and incase of Images or else Scanned PDF’s the formatted output won’t be able to achieve. The scaling and other properties will be helpful in extracting the words exactly without any error.

Regards