Data scraping is not reading full table details in multiple pages in PDF

Hi All,

data scraping is not reading full table details if it data is in multiple pages of PDF. How can I use starting name and ending name for reference to extract the table details from PDF. Please suggest.
Regards
Niranjan

@Niranjan_k

Try this workflow:

Assign pageText = Read PDF Text activity (output: pdfText)
Assign startKeyword = "Start of Table"
Assign endKeyword = "End of Table"

Assign startIndex = pdfText.IndexOf(startKeyword)
Assign endIndex = pdfText.IndexOf(endKeyword)

Assign tableData = pdfText.Substring(startIndex, endIndex - startIndex)

Build DataTable activity (output: extractedTable)
Add Data Column activities for each column in the table

Assign rows = tableData.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For Each row In rows
    Assign columns = row.Split(","c)  // Assuming the data is comma-separated
    Add Data Row activity (Array: columns) to extractedTable

Cheers…!

@Dilli_Reddy could please share the workflow not sure where to apply this logic. Just want to know which tools we need to use build the logic

UiPath.PDF.Activities package

Create a Python Script:

  • Write a Python script that takes a PDF file as input, extracts text data, and outputs relevant information based on your starting and ending name references. You can use a Python library like PyPDF2 or pdfplumber for this task.
import pdfplumber

def extract_table_details(pdf_path, start_name, end_name):
    with pdfplumber.open(pdf_path) as pdf:
        table_details = []
        for page in pdf.pages:
            text = page.extract_text()
            start_index = text.find(start_name)
            end_index = text.find(end_name)
            if start_index != -1 and end_index != -1:
                table_details.append(text[start_index:end_index])
    return table_details

# Example Usage
pdf_path = "path/to/your/pdf/file.pdf"
start_name = "Table Start"
end_name = "Table End"
result = extract_table_details(pdf_path, start_name, end_name)
print(result)
  1. Invoke Python Script from UiPath:
  • Use the “Invoke Python Method” activity in UiPath to execute your Python script. Pass the PDF file path, start name, and end name as arguments.
  1. Process the Result in UiPath:
  • Receive the result from the Python script in UiPath and further process or save the extracted table details as needed.

@ this logic is not working for me. I’m getting Excel Application Scope error for all the times

Can you please share me the sample data.

@copy_writes Sorry I do not have access to share upload access. I’m creating this request on personal system. In PDF file I want to extract table data it is there in multiple pages full table data I want to extract based on start date and end date