Text Extraction From PDF - With Layout Retained

Hi Community,

I have a use-case where the text extraction from PDF documents, with the layout being retained is important as this feeds some downstream REGEX process for specific extraction.

I’ve used the READ PDF TEXT activity for this, as the PDFs are native and not scanned, and have checked the PreserveFormatting flag to try and achieve this > https://docs.uipath.com/activities/docs/read-pdf-text

To a great extent, this activity works great, however i did find instances where for the same type of document, across multiple different copies (in the sense that the data is different), the text output of certain elements have different layout/positions.

To illustrate the above,

Original PDF Document Layout - the item of interest’s position on the PDF itself is the same across different copies

Document 1 Text Extract

Document 2 Text Extract

Sometimes, the ‘key’ and the ‘value’ appears on the same line on the text output, but not always, even though on the PDF itself, they’re on the same line.

  • Just wondering if anyone that has used this activity in more detail before can shed more light on why this happens?

  • Are there other ways (whether it be a UiPath activity/external nuget package/python libraries/third party plugins) that one can extract the PDF text with the layout retained, and perhaps also produce consistent text output?

Appreciate the input, thanks!

@warren_lee

Could you share the sample PDFs

Hi @kumar.varun2 ,

The PDF has client sensitive info so won’t be able to share it publicly here on the forum.
If there’s anything you want to check/try, i can look at it on my end and share the outcome after?