Digitize document - potential new feature: preserve PDF Formatting

Hi all,

Something I use for document data extraction with PDF is the PreserveFormatting from the ReadPDFText activity. it is great for extracting some fields that have a special format (delimited spacing between them or other…)

Unfortunately, when I wish to use the validation station, I need to use the activity Digitize document with an OCR engine. (it doesn’t use OCR if the file is PDF Native)
BUT, there is no option for “PreserveFormatting” and then it messes all numbers up for some document structures (order is not correct anymore and so on)

So, I think adding the option “PreserveFormatting” for native PDF is a great feature to have.

For the moment, I did not find any workaround, because I can’t feed the extracted text to the Create document validation that doesn’t come from digitize document (otherwise, it says DOM doesn’t match the extracted text).

1 Like

Thank you for your suggestions, our team will consider it for future releases :slight_smile:

Hello, was this great idea ever added?

I haven’t noticed. @loginerror ?

Hey @yrobert

Latest version of the IntelligentOCR package has this extra field that allows you to control this behaviour a bit better:

You can now use the Digitize Document activity without OCR whenever needed.

1 Like

I saw that feature. but it doesn’t help/change

Could you maybe show some examples of documents and processes that you still struggle with? This would help us better understand the issue.

Well, imagine various document with multiple types of values/fields in some more or less organized format. Sometimes values can be missing, so it is important to know the ‘position’ of the text.
In version 1 of the roboter, there was no IntelligentOCR and we did it with a lot of regex (10 types of documents, 50 fields, each its own regex - some are the same - all theses regexes are saved in a table). It worked quiet well except for some cases.
Then came the document understanding and we wanted to use it to treat the cases where the v1 of the robot failed. But because the OCR Output of intelligentOCR is NOT the same of the OCR Output of Read PDF with ‘preserve formatting’ set to true, multiple regexes are not valid. So I did have to find some workarounds.

Therefore, I thought it would be quite usefull to have a ‘preserve formatting’ option in IntelligentOCR as well. (and in some cases, without preserve formatting, it is not possible to extract the value)

Hello @yrobert

My intuition is that, you actually found a bug - in theory, the digitize activity should be preserving the formatting and you shouldn’t be encountering the faulty behaviour. Can you maybe share with us workflow/doc & steps to reproduce, so that we can have a look into your issue?

Thank you,

In my case, for one process I’m working on, the Digitized Document text w/ OCR does not keep the same line structure as the text from a PDF native read.
This leads to extraction failures in a few exceptional files

Hello @Mombas ,

Could you please explain your requirement with some screenshots if you need some assistance with your automation?. if you can give some insight into your scenario, maybe we can help you.

@Mombas would you be able to also share the documents/workflow so that we can reproduce the issue?