OCR PDF Title extraction (Newspaper)

I have to extract titles only from a newspaper in pdf using OCR.
So far what i have achieve is just OCR scanning the whole page and giving an output.
The titles have unique characteristics in size and style (BOLD). How can this be done?


Hi @Chittesh_Sham
You can do this using get OCR text activity.
hi @jitendra_123, thanks for the reply.

I have already used the OCR text activity, but it is scanning everything and giving the output of every text that is in that PDF.

I dont want that, what i want is that it extracts specific pieces of text (Titles) and ignore extraction of the text body.


hi @Chittesh_Sham
can you share your sample pdf and what text you’ve got in your Message box.

you are using the Read PDF with OCR. You have to Use Get OCR Text. Can you attach your workflow?

Wow i didn’t know i would get so much support. Thank you guys @jitendra_123 @samir !! ^^
Oops yeah i didnt notice “Get OCR”

Find the pdf attached and my workflow file in the following drive.

What i want to capture are the titles only, for e.g page 5. And i is important that it does that in a non-assisted way as i will have to replicate this process throughout hundreds of pages.

The OCR should output only the titles (In red)

I am afraid that u cant differentiate a heading from normal text using OCR [ google / microsoft ]. esp from a news paper where titles can be present in anywhere in a pdf doc.
i Think ABBY FLEXI Capture OCR has this feature in built to defect the font type level. - it is paid version.

you may explore the latest ‘computer vision’ activity on uipath studio 2019.1 version - not sure whether this might help or not.


Hi @FebinKAndrews thanks for so many useful informations, will definitely check on those :slight_smile: