Screenshot pdf data extraction

Niranjan_k · January 17, 2024, 11:14am

Hi All,

I want to extract data from screenshots converted to PDF file. I have tried with pdf text extract option got blank data. any other way to implement.

Thanks in advance
Niranjan

mkankatala · January 17, 2024, 11:16am

Hi @Niranjan_k

Use Read PDF with OCR activity with Tesseract OCR Engine to extract the text from images and you can use Regular Expressions to extract required data.

Hope it helps!!
Regards

vrdabberu · January 17, 2024, 11:16am

Hi @Niranjan_k

Use read pdf with ocr activity

Change the scalings and profile in the properties panel so that you can extract the data correctly.

and also try changing the Image Dpi also

Try changing the scale from 0 and increase upto 5 and each time you increase try to increase by .5 only

Regards

rlgandu · January 17, 2024, 11:17am

@Niranjan_k

Use Read Pdf text activity then you get the data in string Variable

if it is Scaaned Use Read Pdf with ocr activity

Thanks

Niranjan_k · January 17, 2024, 11:34am

@mkankatala sure I’ll try and check

Niranjan_k · January 17, 2024, 11:35am

@vrdabberu could you please help me with sample workflow

Niranjan_k · January 17, 2024, 11:35am

@rlgandu thank you I will try and check

vrdabberu · January 17, 2024, 11:36am

Hi @Niranjan_k

Could you please share me sample pdf so that I will give you the workflow

Regards

rlgandu · January 17, 2024, 11:40am

@Niranjan_k

Try Yourself if you struck please ask your doubt

Niranjan_k · January 17, 2024, 11:58am

@rlgandu I have extracted data to notepad.
I don’t understand where word begging and ending. It’s very difficult to categorise the data. Any other way it can give the data in right format how data exists in PDF

vrdabberu · January 17, 2024, 12:03pm

Hi @Niranjan_k

Try changing the scaling in Tesseract OCR engine starting from 0 till 5. Increase scaling by 0.5. Try changing the profile to scan and try once default it will be empty. If possible share the PDF.

Happy to help if difficulties faced.

Regards

Niranjan_k · January 17, 2024, 1:10pm

@vrdabberu what should I update for profile scale I have updated for 5. Sorry I do not have access to share internal data.

vrdabberu · January 17, 2024, 1:12pm

@Niranjan_k

Starting the scaling from 1 and run the workflow check how the data is getting extracted. According to that if the data extracted is not correct increase the scaling by 0.5 and check.

Regards

Niranjan_k · January 17, 2024, 1:27pm

@vrdabberu I have doubt on what to select for profile we jhave 3 Options Screen, Scan and Legacy. What should I need to select

vrdabberu · January 17, 2024, 1:29pm

Hi @Niranjan_k

It will differ from pdf to pdf. So, first try without keeping any profile and change the Image Dpi and scale if it fails to work then change th profile to scale and again try changing the Image Dpi and Scales. Need to check in every aspect because we can’t assure in which scale and in which profile it works.

Regards

Niranjan_k · January 18, 2024, 11:37am

@vrdabberu Still im getting same data not in formatted output

vrdabberu · January 18, 2024, 11:48am

HI @Niranjan_k

When using the Read PDF with OCR we will not be able to get the output as the same manner that is present in the PDF file. After using the read PDF with OCR based on the OCR type we will be able to extract the data as paragraph type or in a different manner. We may able to get Format type only when we use the READ PDF TEXT activity and incase of Images or else Scanned PDF’s the formatted output won’t be able to achieve. The scaling and other properties will be helpful in extracting the words exactly without any error.

Regards

Topic		Replies	Views
PDF Extraction---- help Studio pdf , studio , question , activities_panel , pdf-extraction , emailtopdf , pdf-conversion , pdf-to-image , pdf-tag	3	831	October 7, 2022
How to perform pdf automation with images Activities pdf , activities , question	5	358	December 26, 2023
Reading pdf data Help pdf , activities , question	4	1201	November 19, 2019
Converting Pdf to text File Activities pdf , studio , question , activities_panel	6	484	December 26, 2023
How to extract text from pdf image file Activities pdf , activities	5	903	June 7, 2022

Screenshot pdf data extraction

Related topics