Something I use for document data extraction with PDF is the PreserveFormatting from the ReadPDFText activity. it is great for extracting some fields that have a special format (delimited spacing between them or other…)
Unfortunately, when I wish to use the validation station, I need to use the activity Digitize document with an OCR engine. (it doesn’t use OCR if the file is PDF Native)
BUT, there is no option for “PreserveFormatting” and then it messes all numbers up for some document structures (order is not correct anymore and so on)
So, I think adding the option “PreserveFormatting” for native PDF is a great feature to have.
For the moment, I did not find any workaround, because I can’t feed the extracted text to the Create document validation that doesn’t come from digitize document (otherwise, it says DOM doesn’t match the extracted text).
Well, imagine various document with multiple types of values/fields in some more or less organized format. Sometimes values can be missing, so it is important to know the ‘position’ of the text.
In version 1 of the roboter, there was no IntelligentOCR and we did it with a lot of regex (10 types of documents, 50 fields, each its own regex - some are the same - all theses regexes are saved in a table). It worked quiet well except for some cases.
Then came the document understanding and we wanted to use it to treat the cases where the v1 of the robot failed. But because the OCR Output of intelligentOCR is NOT the same of the OCR Output of Read PDF with ‘preserve formatting’ set to true, multiple regexes are not valid. So I did have to find some workarounds.
Therefore, I thought it would be quite usefull to have a ‘preserve formatting’ option in IntelligentOCR as well. (and in some cases, without preserve formatting, it is not possible to extract the value)
My intuition is that, you actually found a bug - in theory, the digitize activity should be preserving the formatting and you shouldn’t be encountering the faulty behaviour. Can you maybe share with us workflow/doc & steps to reproduce, so that we can have a look into your issue?
In my case, for one process I’m working on, the Digitized Document text w/ OCR does not keep the same line structure as the text from a PDF native read.
This leads to extraction failures in a few exceptional files
Could you please explain your requirement with some screenshots if you need some assistance with your automation?. if you can give some insight into your scenario, maybe we can help you.
hi Rahul,
I am also having the similar issue. the Digitize document does not preserve the PDF format. Attached are two files, one with PDF native reader and other by digitize document. The table is messed up.
I tried using DigitizeDocument with/without OCR. same result.
Hi @loginerror, i can’t manege to find this feature in the last version of the Digitize Document activity.
I tried to use the activity without the OCR engine added and it errors (Added screenshot below).
Can you give me an example of using this activity without OCR ?