Digitize document - potential new feature: preserve PDF Formatting

Hi all,

Something I use for document data extraction with PDF is the PreserveFormatting from the ReadPDFText activity. it is great for extracting some fields that have a special format (delimited spacing between them or other…)

Unfortunately, when I wish to use the validation station, I need to use the activity Digitize document with an OCR engine. (it doesn’t use OCR if the file is PDF Native)
BUT, there is no option for “PreserveFormatting” and then it messes all numbers up for some document structures (order is not correct anymore and so on)

So, I think adding the option “PreserveFormatting” for native PDF is a great feature to have.

For the moment, I did not find any workaround, because I can’t feed the extracted text to the Create document validation that doesn’t come from digitize document (otherwise, it says DOM doesn’t match the extracted text).

2 Likes

Thank you for your suggestions, our team will consider it for future releases :slight_smile:

Hello, was this great idea ever added?

I haven’t noticed. @loginerror ?

Hey @yrobert

Latest version of the IntelligentOCR package has this extra field that allows you to control this behaviour a bit better:
image

You can now use the Digitize Document activity without OCR whenever needed.

1 Like

hi,
I saw that feature. but it doesn’t help/change

Could you maybe show some examples of documents and processes that you still struggle with? This would help us better understand the issue.

Well, imagine various document with multiple types of values/fields in some more or less organized format. Sometimes values can be missing, so it is important to know the ‘position’ of the text.
In version 1 of the roboter, there was no IntelligentOCR and we did it with a lot of regex (10 types of documents, 50 fields, each its own regex - some are the same - all theses regexes are saved in a table). It worked quiet well except for some cases.
Then came the document understanding and we wanted to use it to treat the cases where the v1 of the robot failed. But because the OCR Output of intelligentOCR is NOT the same of the OCR Output of Read PDF with ‘preserve formatting’ set to true, multiple regexes are not valid. So I did have to find some workarounds.

Therefore, I thought it would be quite usefull to have a ‘preserve formatting’ option in IntelligentOCR as well. (and in some cases, without preserve formatting, it is not possible to extract the value)

Hello @yrobert

My intuition is that, you actually found a bug - in theory, the digitize activity should be preserving the formatting and you shouldn’t be encountering the faulty behaviour. Can you maybe share with us workflow/doc & steps to reproduce, so that we can have a look into your issue?

Thank you,
Monica

In my case, for one process I’m working on, the Digitized Document text w/ OCR does not keep the same line structure as the text from a PDF native read.
This leads to extraction failures in a few exceptional files

Hello @Mombas ,

Could you please explain your requirement with some screenshots if you need some assistance with your automation?. if you can give some insight into your scenario, maybe we can help you.

@Mombas would you be able to also share the documents/workflow so that we can reproduce the issue?

hi Rahul,
I am also having the similar issue. the Digitize document does not preserve the PDF format. Attached are two files, one with PDF native reader and other by digitize document. The table is messed up.
I tried using DigitizeDocument with/without OCR. same result.

I can’t see your attached PDF

Please see below
Read by nativ readPDF text

                PREVIOUS
DATE             READING          DATE               READING            USAGE

10/11/2022 6817000 9/13/2022 6342000 475000 WATER BILLED 1,725.78
SEWER BILLED 2,479.52
100W MERCURY VAPOR L 190.00
SALES TAX FIXED ELC 13.30
WOODEN POLE 57.00
SALES TAX FIXED ELC 3.99
UNDERGROUND POWER-A/ 47.50
SALES TAX FIXED ELC 3.33
CURRENT BILL $4,520.42
AMOUNT DUE $4,520.42
AMOUNT DUE AFTER 12/03/2022 $4,588.23
DATE READING DATE READING USAGE 10/11/2022 6817000 9/13/2022 6342000 475000 WATER BILLED 1,725.78

Read by Digitize PDF
SEWER BILLED 100W MERCURY VAPOR L SALES TAX FIXED ELC WOODEN POLE SALES TAX FIXED ELC UNDERGROUND POWER-A/ SALES TAX FIXED ELC

CURRENT BILL AMOUNT DUE

AMOUNT DUE AFTER 12/03/2022

2,479.52 190.00 13.30 57.00 3.99 47.50 3.33

$4,520.42 $4,520.42

$4,588.23

Please contact our Customer Service department at 910-671-3800 for billing inquiries.

@Ashish_Agrawal I don’t see your file/how to reproduce your issue - can you help?

Hi @loginerror, i can’t manege to find this feature in the last version of the Digitize Document activity.
I tried to use the activity without the OCR engine added and it errors (Added screenshot below).
Can you give me an example of using this activity without OCR ?

Thanks
Cristian