Hi guys,
I’ve 3 single page PDF files with the same content but different layouts.
Is there a way to get the same text of all of the three files?
I tried “Read PDF Text” with the following result:
Especially the first example is missing some content at the beginning of the text.
Thanks a lot.
Aleem_Khan
(crazy bot)
January 26, 2021, 12:11pm
2
Use anchor base screen scrapping
Aleem_Khan:
or base scree
okay, and which element should be the anchor?
What you mean by this?
Try using set Preserve Formatting as True in the read pdf activity.
well, I just want the whole content of the table in one single output string, like in example no.3.
What I mean is, in example no.1, there is missing this whole first table cell “45500”, and the “EUR” of the third cell is converted to “E”.
In example no.2, the “92” at the end of the output string is really wrong at this position.
Do you unterstand what I mean?
I already tried to set “Preserve Formatting” as true, but it doesn’t change anything in the result.
Yes, when you convert pdf to text this is how to you get the output? it is not 1:1 conversion.
If your pdf is the image, you try reading with read pdf with ocr activity…
If that is no luck, then you can try using CV activities(if allowed in your organization)… there is CV extract table activity which converts table to data table.
PDF read Forum Qurey.zip (660.7 KB)
use this workflow to scrap data from pdf or you can use screen scraping to get by using dynamic selector and for each loop hope this helpful for you if it is please mark as solution
Unfortunately does not work
Because my “PDF-Table” looks like slightly different to yours… (please have a look to the example above).
So use screen scrapping its will work
Aleem_Khan:
creen scra
not really…
The result ist not in a structured format but mixed .
using “full text” option:
using “OCR” option:
I can’t identifier the table fields based on it’s content.
I need the result in exact the same order as it is shown in the PDF Table.
Aleem_Khan
(crazy bot)
January 28, 2021, 6:25pm
11
Use string manipulation to get required result
Hello Michael,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:
In your case, you need to filter multiple spaces here is the timing for you:
42:15 File 9 PDF with multiple spaces on that need to be correct
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
Code:
'FILE1
Dim strtmp As String
strtmp = strin.Substring(strin.IndexOf("Number"), strin.IndexOf("Subtotal") - strin.IndexOf("Number")).Trim
strout = strtmp.Replace(" ", "|")
strtmp = strin.Substring(strin.IndexOf("Subtotal") + 8)
strpar = strtmp.Substring(0, strtmp.IndexOf(Environment.NewLine)).Trim
'FILE2
Dim strtmp As String
Dim strout As String
strout = "Col1|Col2|Col3|Col4"
strtmp = strin.Substring(strin.IndexOf("Vacancies") + 11).Trim
For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
If (line.Length > 3) Then
If (IsNumeric(line(0))) And (line(1) = " ") And (line(2) = " ") Then
strout = strout + Environment.NewLine + line.Replace(" ", "").Replace(" ", "|").Trim
ElseIf (line(0) = "") And (line(1) = " ") And (line(2) = " ") Then
strout = strout + line.Replace(" ", "$").Trim()
This file has been truncated. show original
Thanks,
Cristian Negulescu
Hi Cristian,
many thanks for this video.
I’ll check this out, whether it works for my case.
Regards,
Michael
1 Like