Read PDF - same content, different layout

Michael_Pittner · January 26, 2021, 10:02am

Hi guys,

I’ve 3 single page PDF files with the same content but different layouts.

Is there a way to get the same text of all of the three files?

I tried “Read PDF Text” with the following result:

Especially the first example is missing some content at the beginning of the text.

Thanks a lot.

Aleem_Khan · January 26, 2021, 12:11pm

Use anchor base screen scrapping

Michael_Pittner · January 26, 2021, 12:22pm

okay, and which element should be the anchor?

prasath17 · January 26, 2021, 12:59pm

What you mean by this?

Try using set Preserve Formatting as True in the read pdf activity.

Michael_Pittner · January 26, 2021, 1:11pm

well, I just want the whole content of the table in one single output string, like in example no.3.

What I mean is, in example no.1, there is missing this whole first table cell “45500”, and the “EUR” of the third cell is converted to “E”.

In example no.2, the “92” at the end of the output string is really wrong at this position.

Do you unterstand what I mean?

I already tried to set “Preserve Formatting” as true, but it doesn’t change anything in the result.

prasath17 · January 26, 2021, 1:16pm

Yes, when you convert pdf to text this is how to you get the output? it is not 1:1 conversion.

If your pdf is the image, you try reading with read pdf with ocr activity…

If that is no luck, then you can try using CV activities(if allowed in your organization)… there is CV extract table activity which converts table to data table.

Aleem_Khan · January 26, 2021, 3:06pm

PDF read Forum Qurey.zip (660.7 KB)
use this workflow to scrap data from pdf or you can use screen scraping to get by using dynamic selector and for each loop hope this helpful for you if it is please mark as solution

Michael_Pittner · January 26, 2021, 3:24pm

Unfortunately does not work
Because my “PDF-Table” looks like slightly different to yours… (please have a look to the example above).

Aleem_Khan · January 26, 2021, 3:54pm

So use screen scrapping its will work

Michael_Pittner · January 27, 2021, 8:15am

not really…
The result ist not in a structured format but mixed .

using “full text” option:

using “OCR” option:

I can’t identifier the table fields based on it’s content.
I need the result in exact the same order as it is shown in the PDF Table.

Aleem_Khan · January 28, 2021, 6:25pm

Use string manipulation to get required result

Cristian_Negulescu · February 28, 2021, 8:08pm

Hello Michael,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

In your case, you need to filter multiple spaces here is the timing for you:

42:15 File 9 PDF with multiple spaces on that need to be correct
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data

Code:

github.com

cristinegulescu/startUiPathFromSalesforce/blob/master/PDFdecode.txt

        'FILE1
        Dim strtmp As String
        strtmp = strin.Substring(strin.IndexOf("Number"), strin.IndexOf("Subtotal") - strin.IndexOf("Number")).Trim
        strout = strtmp.Replace(" ", "|")

        strtmp = strin.Substring(strin.IndexOf("Subtotal") + 8)
        strpar = strtmp.Substring(0, strtmp.IndexOf(Environment.NewLine)).Trim


        'FILE2
        Dim strtmp As String
        Dim strout As String
        strout = "Col1|Col2|Col3|Col4"
        strtmp = strin.Substring(strin.IndexOf("Vacancies") + 11).Trim
        For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
            If (line.Length > 3) Then
                If (IsNumeric(line(0))) And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + Environment.NewLine + line.Replace("  ", "").Replace("  ", "|").Trim
                ElseIf (line(0) = "") And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + line.Replace("  ", "$").Trim()

This file has been truncated. show original

Thanks,
Cristian Negulescu

Michael_Pittner · March 2, 2021, 11:18am

Hi Cristian,

many thanks for this video.
I’ll check this out, whether it works for my case.

Regards,
Michael

Topic		Replies	Views
Text Extraction From PDF - With Layout Retained Activities pdf , activities , question	2	932	August 18, 2021
Read PDF Text activity is not working for PDF in Text format Help	4	6471	September 18, 2018
PDF to text help Learn activities , question	4	680	July 6, 2020
PDF Text Help	6	1281	May 13, 2019
Data reading from a table in pdf (Text) Studio pdf , activities	3	735	May 20, 2020

Most Active Users - Yesterday
ashokkarale
MD_Farhan1
Ajay_Mishra
postwick
Dheerendra_vishwakarma
Anil_G
chandreshsinh.jadeja
Gautham_Pattabiraman
vrdabberu
aravindbalineni123
More details...

Read PDF - same content, different layout

Related Topics