Text Extraction From PDF - With Layout Retained

warren_lee · August 17, 2021, 1:47am

Hi Community,

I have a use-case where the text extraction from PDF documents, with the layout being retained is important as this feeds some downstream REGEX process for specific extraction.

I’ve used the READ PDF TEXT activity for this, as the PDFs are native and not scanned, and have checked the PreserveFormatting flag to try and achieve this > https://docs.uipath.com/activities/docs/read-pdf-text

To a great extent, this activity works great, however i did find instances where for the same type of document, across multiple different copies (in the sense that the data is different), the text output of certain elements have different layout/positions.

To illustrate the above,

Original PDF Document Layout - the item of interest’s position on the PDF itself is the same across different copies

Document 1 Text Extract

Document 2 Text Extract

Sometimes, the ‘key’ and the ‘value’ appears on the same line on the text output, but not always, even though on the PDF itself, they’re on the same line.

Just wondering if anyone that has used this activity in more detail before can shed more light on why this happens?
Are there other ways (whether it be a UiPath activity/external nuget package/python libraries/third party plugins) that one can extract the PDF text with the layout retained, and perhaps also produce consistent text output?

Appreciate the input, thanks!

kumar.varun2 · August 17, 2021, 2:04am

@warren_lee

Could you share the sample PDFs

warren_lee · August 18, 2021, 1:33am

Hi @kumar.varun2 ,

The PDF has client sensitive info so won’t be able to share it publicly here on the forum.
If there’s anything you want to check/try, i can look at it on my end and share the outcome after?

Topic		Replies	Views
Extract text from pdf while preserving the format Activities pdf , activities , question	0	147	December 21, 2023
About OCR Engines Activities ocr , activities , question	8	1227	July 4, 2023
Extract data from multiple pdf,need guidance Activities pdf , activities , question	0	774	December 22, 2021
Extracting data from PDF-s Studio uiautomation	6	588	July 27, 2022
Read PDF Text does not separate columns correctly Help pdf , activities , faq	3	1225	November 26, 2020

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

Text Extraction From PDF - With Layout Retained

Related Topics