Digitize document - potential new feature: preserve PDF Formatting

yrobert · January 25, 2021, 12:32pm

Hi all,

Something I use for document data extraction with PDF is the PreserveFormatting from the ReadPDFText activity. it is great for extracting some fields that have a special format (delimited spacing between them or other…)

Unfortunately, when I wish to use the validation station, I need to use the activity Digitize document with an OCR engine. (it doesn’t use OCR if the file is PDF Native)
BUT, there is no option for “PreserveFormatting” and then it messes all numbers up for some document structures (order is not correct anymore and so on)

So, I think adding the option “PreserveFormatting” for native PDF is a great feature to have.

For the moment, I did not find any workaround, because I can’t feed the extracted text to the Create document validation that doesn’t come from digitize document (otherwise, it says DOM doesn’t match the extracted text).

loginerror · January 26, 2021, 10:16am

Thank you for your suggestions, our team will consider it for future releases

yusuf_aziz · April 11, 2022, 4:22pm

Hello, was this great idea ever added?

yrobert · April 15, 2022, 9:56am

I haven’t noticed. @loginerror ?

loginerror · April 15, 2022, 4:15pm

Hey @yrobert

Latest version of the IntelligentOCR package has this extra field that allows you to control this behaviour a bit better:

You can now use the Digitize Document activity without OCR whenever needed.

yrobert · April 19, 2022, 6:39am

hi,
I saw that feature. but it doesn’t help/change

loginerror · April 28, 2022, 4:27pm

Could you maybe show some examples of documents and processes that you still struggle with? This would help us better understand the issue.

yrobert · April 29, 2022, 12:41pm

Well, imagine various document with multiple types of values/fields in some more or less organized format. Sometimes values can be missing, so it is important to know the ‘position’ of the text.
In version 1 of the roboter, there was no IntelligentOCR and we did it with a lot of regex (10 types of documents, 50 fields, each its own regex - some are the same - all theses regexes are saved in a table). It worked quiet well except for some cases.
Then came the document understanding and we wanted to use it to treat the cases where the v1 of the robot failed. But because the OCR Output of intelligentOCR is NOT the same of the OCR Output of Read PDF with ‘preserve formatting’ set to true, multiple regexes are not valid. So I did have to find some workarounds.

Therefore, I thought it would be quite usefull to have a ‘preserve formatting’ option in IntelligentOCR as well. (and in some cases, without preserve formatting, it is not possible to extract the value)

Monica_Secelean · May 5, 2022, 10:53am

Hello @yrobert

My intuition is that, you actually found a bug - in theory, the digitize activity should be preserving the formatting and you shouldn’t be encountering the faulty behaviour. Can you maybe share with us workflow/doc & steps to reproduce, so that we can have a look into your issue?

Thank you,
Monica

Mombas · May 14, 2022, 12:12am

In my case, for one process I’m working on, the Digitized Document text w/ OCR does not keep the same line structure as the text from a PDF native read.
This leads to extraction failures in a few exceptional files

Rahul_Unnikrishnan · May 14, 2022, 6:46am

Hello @Mombas ,

Could you please explain your requirement with some screenshots if you need some assistance with your automation?. if you can give some insight into your scenario, maybe we can help you.

Monica_Secelean · May 14, 2022, 8:51am

@Mombas would you be able to also share the documents/workflow so that we can reproduce the issue?

Ashish_Agrawal · November 7, 2022, 8:42pm

hi Rahul,
I am also having the similar issue. the Digitize document does not preserve the PDF format. Attached are two files, one with PDF native reader and other by digitize document. The table is messed up.
I tried using DigitizeDocument with/without OCR. same result.

yrobert · November 8, 2022, 8:12am

I can’t see your attached PDF

Ashish_Agrawal · November 8, 2022, 6:05pm

Please see below
Read by nativ readPDF text

                PREVIOUS
DATE             READING          DATE               READING            USAGE

10/11/2022 6817000 9/13/2022 6342000 475000 WATER BILLED 1,725.78
SEWER BILLED 2,479.52
100W MERCURY VAPOR L 190.00
SALES TAX FIXED ELC 13.30
WOODEN POLE 57.00
SALES TAX FIXED ELC 3.99
UNDERGROUND POWER-A/ 47.50
SALES TAX FIXED ELC 3.33
CURRENT BILL $4,520.42
AMOUNT DUE $4,520.42
AMOUNT DUE AFTER 12/03/2022 $4,588.23
DATE READING DATE READING USAGE 10/11/2022 6817000 9/13/2022 6342000 475000 WATER BILLED 1,725.78

Read by Digitize PDF
SEWER BILLED 100W MERCURY VAPOR L SALES TAX FIXED ELC WOODEN POLE SALES TAX FIXED ELC UNDERGROUND POWER-A/ SALES TAX FIXED ELC

CURRENT BILL AMOUNT DUE

AMOUNT DUE AFTER 12/03/2022

2,479.52 190.00 13.30 57.00 3.99 47.50 3.33

$4,520.42 $4,520.42

$4,588.23

Please contact our Customer Service department at 910-671-3800 for billing inquiries.

Monica_Secelean · May 16, 2023, 10:08am

@Ashish_Agrawal I don’t see your file/how to reproduce your issue - can you help?

CristianDan · February 15, 2024, 8:28am

Hi @loginerror, i can’t manege to find this feature in the last version of the Digitize Document activity.
I tried to use the activity without the OCR engine added and it errors (Added screenshot below).
Can you give me an example of using this activity without OCR ?

Thanks
Cristian

Topic		Replies	Views
Document Understanding – Digitize Document – Native PDF inaccuracies Document Understanding	6	1977	April 18, 2022
Digitized Document text format Issue Document Understanding	7	1410	July 13, 2020
Activity Request: DataTable Validation Activities datatable , activities , considering , intelligent_ocr	1	992	May 28, 2020
Read PDF Text Activity should also return structured text Activities activities , considering	12	4091	January 29, 2020
Read Text from Specific Region Activities pdf , activities , question	7	993	November 14, 2022

Digitize document - potential new feature: preserve PDF Formatting

Related topics