Function 'Read PDF Text' does not handles header and footer correctly

Hi,

When I convert a PDF into text (3 pages) I expect the following string:

Header 1
Body1
Footer 1
Header 2
Body2
Footer 2
Header 3
Body3
Footer 3

But this is what I get now:

Header 1
Footer 1
Body1Header 2
Footer 2
Body2Header 3
Footer 3
Body3

The biggest problem is that the header is just behind the last part of the body.

Example

last line on page 1 : 18 august 2018
String will look like this : 18 august 20182

1.pdf (182.7 KB)
Main.xaml (6.7 KB)

1 Like

Hi Mark,

You can create a custom Actvity based on iTextSharp API to parse PDF Text. It works fine and gives intended result.

Link: GitHub - itext/itextsharp: [DEPRECATED] .NET port of the iText library, only security fixes will be added — please use iText 7 for .NET
c# - Itextsharp text extraction - Stack Overflow

Regards,
Tuhin

Hi Tuhin,

I like to keep it simple.
What I hope is that this function will be updated in future release.
:smiley:

Mark

Hi Mark,

The API used in Read PDF Text Activity is a free library which parses the data that way. Yes definitely, we will consider these shortcomings in future releases. :slight_smile:

Regards,
Tuhin

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.