Function 'Read PDF Text' does not handles header and footer correctly

Mark_Vollenbrock · January 7, 2019, 10:16am

Hi,

When I convert a PDF into text (3 pages) I expect the following string:

Header 1
Body1
Footer 1
Header 2
Body2
Footer 2
Header 3
Body3
Footer 3

But this is what I get now:

Header 1
Footer 1
Body1Header 2
Footer 2
Body2Header 3
Footer 3
Body3

The biggest problem is that the header is just behind the last part of the body.

Example

last line on page 1 : 18 august 2018
String will look like this : 18 august 20182

1.pdf (182.7 KB)
Main.xaml (6.7 KB)

Tuhin_Samanta · January 8, 2019, 4:35pm

Hi Mark,

You can create a custom Actvity based on iTextSharp API to parse PDF Text. It works fine and gives intended result.

Link: GitHub - itext/itextsharp: [DEPRECATED] .NET port of the iText library, only security fixes will be added — please use iText 7 for .NET
c# - Itextsharp text extraction - Stack Overflow

Regards,
Tuhin

Mark_Vollenbrock · January 9, 2019, 10:37am

Hi Tuhin,

I like to keep it simple.
What I hope is that this function will be updated in future release.

Mark

Tuhin_Samanta · January 9, 2019, 10:59am

Hi Mark,

The API used in Read PDF Text Activity is a free library which parses the data that way. Yes definitely, we will consider these shortcomings in future releases.

Regards,
Tuhin

Topic		Replies	Views
Read the content of the PDF file without Headers and Footer Help	2	3543	November 5, 2019
Removing new page header in PDF Robot	9	480	August 23, 2023
Need to remove PDF Headers and Footers Help	6	2778	January 9, 2019
How to remove header and footer from pdf using regex Studio uiautomation , pdf , robot , activities , studio , question , workflow_diff , split , pdf-extraction , save-as , pdf-conversion , pdf-tag , remove-text	5	3100	August 1, 2022
PDF automation- How to get data in PDFs which is having paragraph Studio pdf , activities	7	1293	May 29, 2020

Function 'Read PDF Text' does not handles header and footer correctly

Example

Related topics