Read Text from Specific Region

Hi,

Currently facing an issue where I need to extract a value from a native PDF file. I used read text from PDF and wanted to manipulate the string via regex but due to some other values next to the one I need, the overall string becomes ambiguous. Sample below:

image

from the numbers 1 2 7 5 2 1 → 1 2 7 is actually referring to page count which is 1 / 27 (the “/” doesn’t get picked up) while 5 2 1 is the value I need and this value can differ. Is there a way to specify the specific region of the PDF to extract values as the position for this is fixed?

HI @CC_Pet

In the PDF do you have the / ?

Have you Tried with Read PDF with OCR activity?

image

In the Read PDF Text Activity Just set properties as True for Preserve Formatting

Regards
Gokul

Hi @CC_Pet use read pdf with OCR activity with microsoft ocr engine.

Have tried using OCR but the values aren’t being picked up by the engine. Currently testing on the read pdf text again and will need to use it twice as using PreserveFormatting will interfere with a different value needed to be extracted. So will use one where PreserveFormatting is False to extract value A and with PreserveFormatting and some string manipulation to extract value B.

Hi @CC_Pet

Try checking the property of preserve formatting in read pdf that might help in keeping your value and the page number separately

Cheers

In the PDF do you have the / ? @CC_Pet

Sorry needed to edit my previous reply. I have made the switch to OCR using Microsoft engine and using the activity twice with different profile settings (one for each value needed to be extracted). When using preservedformat for read pdf text, it is still unable to pick up “/” and the spacing between the numbers make it difficult to do regex/string manipulation.

You can try with all the OCR from the below image @CC_Pet , Can you check whether you get the desired output while using this OCR’s?

image

Regards
Gokul