Digitize Document Putting Multiple Lines in PDF Onto One Line

Alex_Marasco · July 15, 2020, 4:00pm

I am using the digitize document activity to digitize a PDF’s first page that has this format:

I wasn’t sure why the Regex Extractor wasn’t recognizing new lines until I wrote the digitization output to a text file and found the output of the first page text looked like this (ignore mouse cursor before line 17):

I created a taxonomy for the first page with all the fields I need to extract to extract.

Is there a way to fix this? On every other page, it’s fine with newlines and bullet points. It’s just this the first page doing this. I don’t think the taxonomy is the reason.

tudor.serban · July 15, 2020, 7:42pm

Hi @Alex_Marasco: is this a native PDF? Have you tried using the Force OCR flag on Digitize Document?

Alex_Marasco · July 16, 2020, 1:24pm

Yes is it. What is the force OCR flag and how do I use that?

tudor.serban · July 16, 2020, 2:03pm

On the Digitize Document activity you have a flag that you can check called Force OCR. This will cause Digitize Document to treat even native PDFs as images, thus potentially improving the results of digitization in cases such as this one.

Topic		Replies	Views
Option to Digitize Single Page Activities pdf , question , document_understanding	4	1796	March 3, 2022
I am having a scanned copies of pdf. I am trying to extract paticular field like name, valid from date and valid to dates.But these fileds are twice in single page so its extracting twice in simple fileds Activities activities , question , document_processing	4	929	March 11, 2022
Document Understanding – Digitize Document – Native PDF inaccuracies Document Understanding	6	1946	April 18, 2022
Digitize document - potential new feature: preserve PDF Formatting Activities pdf , activities , completed , feedback , document_understanding , intelligent_ocr	16	2911	February 15, 2024
Different results reading a Native PDF File and Scanned PDF File with the same OCR Activities activities , question , document_understanding	2	1881	March 6, 2022

Most Active Users - Yesterday
mkankatala
ashokkarale
Yoichi
sven.wullum1
chandreshsinh.jadeja
sharazkm32
sonaliaggarwal47
Ankit_Kumar2
SorenB
fanmixco
More details...

Digitize Document Putting Multiple Lines in PDF Onto One Line

Related topics