The pdfs I’m working with are scanned, and so far no OCR has given completely accurate results despite the quality of the pdfs being seemingly great. There is no handwritten text or blurred text.
I am currently using ‘Read PDF with OCR’ activity with ‘Microsoft Azure Computer Vision OCR’ as an engine, as that engine gave me the best results compared to Tesseract and OmniPage. So far I was using string manipulation and regex to get the values I need.
There is no way to get this table using string manipulation (that I can think of at least). This is a representation of the format of my table (with some random example values):
Date Amount1 Amount2 Text
10/12 4.20 Example text
10/13 4.20 Example text2
10/20 5.30 Example text3
The issue is with extraction of the Amount1 and Amount2 rows. For each row either Amount1 or Amount2 is populated, but it is important for me to know which one is populated and which one is not. The result I receive from Microsoft Azure CV OCR is in this format:
Date
Amount1
AMount2
Text
10/12
4.20
Example text
10/13
4.20
Example text2
10/20
5.30
Example text3
Based on this result there is no way for me to know which of the two Amount columns the decimal value for each row belongs to. I tried Tesseract and OmniPage as well hoping that there will at least be an extra space or something to distinguish them by, but there is not.
I tried using the ‘Results’ value from Microsoft Azure CV OCR, that is supposedly supposed to give me the Key Value pairs of each word and their position in the PDF (with the ‘Extract Words’), but I only get 1 key value pair, that contains the entire text of the PDF.
I can’t really think of any alternative solutions to my issue, so any help will be appreciated.
I have no control over the pdfs I use and their format, and I can’t share one in this forum as they contain sensitive data.