OCR PDF Title extraction (Newspaper)

Hello everyone,

I have to extract titles only from a newspaper in pdf using OCR.
So far what i have achieve is just OCR scanning the whole page and giving an output.
The titles have unique characteristics in size and style (BOLD). How can this be done?

thanks

1 Like

Hi @Chittesh_Sham
You can do this using get OCR text activity.
Thanks & Regards

1 Like

project

hi @jitendra_123, thanks for the reply.

I have already used the OCR text activity, but it is scanning everything and giving the output of every text that is in that PDF.

I dont want that, what i want is that it extracts specific pieces of text (Titles) and ignore extraction of the text body.

Cheers,

1 Like

hi @Chittesh_Sham
can you share your sample pdf and what text you’ve got in your Message box.

1 Like

you are using the Read PDF with OCR. You have to Use Get OCR Text. Can you attach your workflow?

1 Like

Wow i didn’t know i would get so much support. Thank you guys @jitendra_123 @samir !! ^^
Oops yeah i didnt notice “Get OCR”

Find the pdf attached and my workflow file in the following drive.
https://drive.google.com/drive/folders/1t0XKySmq8CMCfsqyWtTpiofMbiFfYaM0

What i want to capture are the titles only, for e.g page 5. And i is important that it does that in a non-assisted way as i will have to replicate this process throughout hundreds of pages.

The OCR should output only the titles (In red)

1 Like

I am afraid that u cant differentiate a heading from normal text using OCR [ google / microsoft ]. esp from a news paper where titles can be present in anywhere in a pdf doc.
i Think ABBY FLEXI Capture OCR has this feature in built to defect the font type level. - it is paid version.

you may explore the latest ‘computer vision’ activity on uipath studio 2019.1 version - not sure whether this might help or not.

2 Likes

Hi @FebinKAndrews thanks for so many useful informations, will definitely check on those :slight_smile: