Extract values in PDF

Noor_Shaik · February 15, 2021, 11:48am

I need to extract data from PDF or text file, I am using regex but sometimes paten are changing, I need highlighted value

SenzoD · February 15, 2021, 12:12pm

Which OCR are you using here? and have you tried document understanding with Regex based extractor?

If i were you i would write a while loop and find the word TAX and from that find each value one by one in that column, but you might need to use an OCR engine for this(Microsoft OCR usually preserves the format of data so it still looks the same as it was when inside a table)

NIVED_NAMBIAR · February 15, 2021, 12:19pm

In addition to what @SenzoD said

U can try to extract the table directly using epsilon activitiy which helps to extract the table from PDF and using datatable manipulation u can get the data u need

Regards

Nived N

Happy Automation

GreenTea · February 15, 2021, 1:34pm

Hi @Noor_Shaik

Alternative:
Word 2016, 2019 or Office 365 can open your PDF. Extract Table on Word Document.

Cristian_Negulescu · February 28, 2021, 8:17pm

Hello Noor,
In this video, I have 17 use-cases for extracting tables from PDF and write data in Excel:

2:00 GitHub free code for all the files
2:20 Logic of general workflow
4:40 File 1 simple PDF
9:50 File 2 PDF with a column with multiple lines
20:10 File 3 PDF with a column with multiple words ON the LAST column
27:00 File 5 PDF with a column with multiple words ON inside column (2 columns)
31:40 File 6 PDF with a column with multiple lines
39:10 File 8 simple PDF
42:15 File 9 PDF with multiple spaces on that need to be correct
45:50 File 10 PDF with multiple columns that have multiple lines + multiple pages
55:50 File 11 simple PDF with protection empty Cells
58:35 File 12 Big PDF with an empty line and Empty columns and partial total
1:02:25 File 13 PDF with multiple columns that have multiple words and hard to define a rule
1:10:15 File 15 PDF with multiple columns that have multiple lines
1:12:50 File 17 simple PDF remove spaces from headers also remove space from Data
1:16:05 File 18 simple PDF
1:17:10 File 19 PDF with multiple pages and columns with multiple lines
1:22:10 File 20 PDF with multiple columns that have multiple lines
1:25:00 File 21 PDF with empty columns and subtotal

Code:

github.com

cristinegulescu/startUiPathFromSalesforce/blob/master/PDFdecode.txt

        'FILE1
        Dim strtmp As String
        strtmp = strin.Substring(strin.IndexOf("Number"), strin.IndexOf("Subtotal") - strin.IndexOf("Number")).Trim
        strout = strtmp.Replace(" ", "|")

        strtmp = strin.Substring(strin.IndexOf("Subtotal") + 8)
        strpar = strtmp.Substring(0, strtmp.IndexOf(Environment.NewLine)).Trim


        'FILE2
        Dim strtmp As String
        Dim strout As String
        strout = "Col1|Col2|Col3|Col4"
        strtmp = strin.Substring(strin.IndexOf("Vacancies") + 11).Trim
        For Each line As String In strtmp.Split(New String() {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
            If (line.Length > 3) Then
                If (IsNumeric(line(0))) And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + Environment.NewLine + line.Replace("  ", "").Replace("  ", "|").Trim
                ElseIf (line(0) = "") And (line(1) = " ") And (line(2) = " ") Then
                    strout = strout + line.Replace("  ", "$").Trim()

This file has been truncated. show original

Thanks,
Cristian Negulescu

GreenTea · March 1, 2021, 6:30am

Thank you very much @Cristian_Negulescu. This technique will greatly speed up the extraction process.

May_Guerrero · December 29, 2022, 8:25pm

OMG!!! after watching your video and reviewing your code, I improved my vb.net skills and boom
Simply a live saver, thank you, thank you, thank you

seshu_u · June 16, 2023, 6:16am

Hi @Cristian_Negulescu i need your support where i am struggling to extract data from a scanned pdf. The extracted data is not accurate. kindly help me to get accurate data from scanned pdf(Table with Columns and sub columns)

Cristian_Negulescu · June 16, 2023, 6:33am

Hello Seshu,
I the data is not accurate you need to change the OCR system. I’m not able to solve this from the code.
The code in VB.net is good when data is unstructured but accurate. Try Document Understanding or you can try ChatGPT like this:
UiPath and ChatGPT extract Tables from PDF (use case) (PDF table) (ChatGPT prompts) - YouTube
Thanks,
Cristian

Topic		Replies	Views
How to Extract tabel data from pdf file Help studio , question	3	727	March 1, 2021
Tabular data extraction from pdf to excel Studio excel , pdf	16	2818	March 5, 2021
Extract table from PDF using Regex Studio	3	2344	February 24, 2021
Extract specific table within PDF Form with RegEx Studio studio , question , activities_panel	12	1911	March 8, 2023
How to get table from pdf Help studio , question	20	1955	February 28, 2021

Extract values in PDF

Related topics