Hi Community,
I have a use-case where the text extraction from PDF documents, with the layout being retained is important as this feeds some downstream REGEX process for specific extraction.
I’ve used the READ PDF TEXT activity for this, as the PDFs are native and not scanned, and have checked the PreserveFormatting flag to try and achieve this > https://docs.uipath.com/activities/docs/read-pdf-text
To a great extent, this activity works great, however i did find instances where for the same type of document, across multiple different copies (in the sense that the data is different), the text output of certain elements have different layout/positions.
To illustrate the above,
Original PDF Document Layout
- the item of interest’s position on the PDF itself is the same across different copies
Document 1 Text Extract
Document 2 Text Extract
Sometimes, the ‘key’ and the ‘value’ appears on the same line on the text output, but not always, even though on the PDF itself, they’re on the same line.
-
Just wondering if anyone that has used this activity in more detail before can shed more light on why this happens?
-
Are there other ways (whether it be a UiPath activity/external nuget package/python libraries/third party plugins) that one can extract the PDF text with the layout retained, and perhaps also produce consistent text output?
Appreciate the input, thanks!