I have to extract titles only from a newspaper in pdf using OCR.
So far what i have achieve is just OCR scanning the whole page and giving an output.
The titles have unique characteristics in size and style (BOLD). How can this be done?
What i want to capture are the titles only, for e.g page 5. And i is important that it does that in a non-assisted way as i will have to replicate this process throughout hundreds of pages.
I am afraid that u cant differentiate a heading from normal text using OCR [ google / microsoft ]. esp from a news paper where titles can be present in anywhere in a pdf doc.
i Think ABBY FLEXI Capture OCR has this feature in built to defect the font type level. - it is paid version.
you may explore the latest ‘computer vision’ activity on uipath studio 2019.1 version - not sure whether this might help or not.