PDF TABLES EXTRACTION EVEN WHEN ROWS ARE DYNAMIC

I need to extract multiple tables from multiple pdf’s where in each pdf table rows might be changing. Help me to get this done.

Thanks in advance.

Hi @bhanusai_sajja ,

Depending on whether the PDF documents are Digital or Scanned, we could provide further suggestions on what needs to be correct approach.

In Both cases, we would however require a view of the Data format in the PDF. Depending on that if the pattern remains the same we could maybe apply Regex and Extract the data.

If Digital Document, You could maybe also check with the below post :

Let us know if it doesn’t work and provide more info on the Data format of the PDF data.

Hi @supermanPunch,

First of all Thank you for your quick response

Your Workflow working fine for other pdf’s but when i am trying with my PDF


i am getting this pop up(Shown in Screenshot11) and throwing error(Screenshot12)

.
.
Can you please tell me how to resolve this error.

To extract tables from multiple PDFs where the table rows may vary in RPA (Robotic Process Automation) using R, you can use the package “pdftools”. Here’s an example of how you can extract tables from PDFs:

  1. First, you will need to install and load the pdftools package:
    install.packages(“pdftools”)
    library(pdftools)
  2. Next, read the PDF using the pdf_text() function and convert it to a data frame using the read_pdf() function from the “tidyverse” package:
    pdf_data ← pdf_text(“path/to/pdf.pdf”)
    pdf_tables ← read_pdf(pdf_data,pages = 1:5, encoding = “utf-8”)
  3. Now, you can use the functions from the “dplyr” and “tidyr” packages to clean and manipulate the data as needed.
    pdf_tables_clean ← pdf_tables %>%
    filter(!is.na(page)) %>%
    group_by(page) %>%
    mutate(row = row_number()) %>%
    spread(x, text)
  4. Repeat the above steps for all the pdfs in the directory.
    pdf_files ← list.files(“path/to/pdfs/”, full.names = TRUE, pattern = “*.pdf”)

pdf_list ← lapply(pdf_files, function(x){
pdf_data ← pdf_text(x)
pdf_tables ← read_pdf(pdf_data,pages = 1:5, encoding = “utf-8”)
pdf_tables_clean ← pdf_tables %>%
filter(!is.na(page)) %>%
group_by(page) %>%
mutate(row = row_number()) %>%
spread(x, text)
return(pdf_tables_clean)
})
5. Finally, you can use the bind_rows() function from the “dplyr” package to combine all the data frames into a single data frame.
all_pdf_tables ← bind_rows(pdf_list)
Note: This is a basic example and you might need to modify it according to your requirement like handling password protected pdfs, handling missing values, handling different pdf file format etc.
Also make sure that you have necessary dependencies installed and loaded.
Also, it’s better to use try-catch block while reading pdfs to handle any issues during the reading process.

hi @supermanPunch,
Can you please help me out of this.

Hi @bhanusai_sajja ,

I believe the Sample PDF shared is a sample data only and not of confidential data. If it is confidential, Do Let me know. However, I am excluding the PDF sample that was provide to me in private.

Check the below workflow :
PDF_To_Excel_Corrected.zip (4.0 KB)

The Data that you have provided, seemed to have some problem with the column name, and hence it was throwing out error. I did correct that part by implementing a counter/index to the column names if it was already having the same name.

Check the workflow for your PDF data and let us know what are your findings. There are some deviances in the column name part.

1 Like

Hi @supermanPunch,

Thanks for all your efforts on this. Currently we have 4 pdf’s, in future we may get more or less files than we are currently having but all files will have same format. But your workflow is working for only two pdf’s out of 4. Could you make it work for any pdf which will be in the format of i sent to you ?

This is how i am getting error when i run your workflow for one of the pdf files i have.

@bhanusai_sajja ,

Is it possible for you to point out what might be the possible difference between the PDF files ?

Could you also send the PDF which is not working (In private if confidential) ?

There is a possibility that this method might not work with all the PDF’s, but since it did work with the sample PDF’s shared, I did assume the same formatting would be followed and hence it would work. We might need to shift the approach to a different method if that is the case.

Hi @supermanPunch,

Forget about those pdf’s. I need a small help related to excel automation.

Check the “Sheet2” in the attached excel file in this sheet i have apple multiple times with amounts here i want total amount of apples but if i use message box it will display 24 two times as we have apple two times. But i want to display the 24 only once. Can we achieve this? Could you help me out how to do this ?

Below is the Zip folder that contains workflow(Use “Sequence”, ignore that "Main.xaml ") to give you idea of how i am doing, Excel file.

AppleSum.zip (31.0 KB)

Please help me out.

@bhanusai_sajja ,

As this is a Total Diversion from your original Topic, We would like to suggest to create a Separate Topic for this new case related to Excel Automation.

If you have found the solution for the original Topic you could post the Solution and Close this topic. If it is not yet solved, you could mention what are the shortcomings for the suggestions provided and keep the topic open, so that others could contribute to the Topic mentioned earlier.

Do note that we would not want to mix two different topics together.