Currently facing an issue where I need to extract a value from a native PDF file. I used read text from PDF and wanted to manipulate the string via regex but due to some other values next to the one I need, the overall string becomes ambiguous. Sample below:
from the numbers 1 2 7 5 2 1 → 1 2 7 is actually referring to page count which is 1 / 27 (the “/” doesn’t get picked up) while 5 2 1 is the value I need and this value can differ. Is there a way to specify the specific region of the PDF to extract values as the position for this is fixed?
Have tried using OCR but the values aren’t being picked up by the engine. Currently testing on the read pdf text again and will need to use it twice as using PreserveFormatting will interfere with a different value needed to be extracted. So will use one where PreserveFormatting is False to extract value A and with PreserveFormatting and some string manipulation to extract value B.
Sorry needed to edit my previous reply. I have made the switch to OCR using Microsoft engine and using the activity twice with different profile settings (one for each value needed to be extracted). When using preservedformat for read pdf text, it is still unable to pick up “/” and the spacing between the numbers make it difficult to do regex/string manipulation.