Is there a way to extract a table accurately from PDF with OCR

M_Kr · October 12, 2023, 7:35am

The pdfs I’m working with are scanned, and so far no OCR has given completely accurate results despite the quality of the pdfs being seemingly great. There is no handwritten text or blurred text.

I am currently using ‘Read PDF with OCR’ activity with ‘Microsoft Azure Computer Vision OCR’ as an engine, as that engine gave me the best results compared to Tesseract and OmniPage. So far I was using string manipulation and regex to get the values I need.

There is no way to get this table using string manipulation (that I can think of at least). This is a representation of the format of my table (with some random example values):

Date        Amount1        Amount2        Text
10/12       4.20                          Example text
10/13       4.20                          Example text2
10/20                      5.30           Example text3

The issue is with extraction of the Amount1 and Amount2 rows. For each row either Amount1 or Amount2 is populated, but it is important for me to know which one is populated and which one is not. The result I receive from Microsoft Azure CV OCR is in this format:

Date
Amount1
AMount2
Text
10/12
4.20
Example text
10/13
4.20
Example text2
10/20
5.30
Example text3

Based on this result there is no way for me to know which of the two Amount columns the decimal value for each row belongs to. I tried Tesseract and OmniPage as well hoping that there will at least be an extra space or something to distinguish them by, but there is not.

I tried using the ‘Results’ value from Microsoft Azure CV OCR, that is supposedly supposed to give me the Key Value pairs of each word and their position in the PDF (with the ‘Extract Words’), but I only get 1 key value pair, that contains the entire text of the PDF.

I can’t really think of any alternative solutions to my issue, so any help will be appreciated.

I have no control over the pdfs I use and their format, and I can’t share one in this forum as they contain sensitive data.

Topic		Replies	Views
Computer Vision - Extract table from a PDF Help	4	2461	February 18, 2020
Microsoft Azure Computer Vision OCR returns incorrect 'Result' output Activities ocr , activities , question , azure	3	668	October 16, 2023
PDF tabular data extraction Studio	3	799	February 24, 2021
Extract values in PDF Studio	8	1304	June 16, 2023
Unable to extract the correct data from PDF Activities pdf , activities , question	9	1831	November 9, 2021

Is there a way to extract a table accurately from PDF with OCR

Related topics