Data scraping is not reading full table details in multiple pages in PDF

Niranjan_k · December 5, 2023, 11:54am

Hi All,

data scraping is not reading full table details if it data is in multiple pages of PDF. How can I use starting name and ending name for reference to extract the table details from PDF. Please suggest.
Regards
Niranjan

Dilli_Reddy · December 5, 2023, 1:06pm

@Niranjan_k

Try this workflow:

Assign pageText = Read PDF Text activity (output: pdfText)
Assign startKeyword = "Start of Table"
Assign endKeyword = "End of Table"

Assign startIndex = pdfText.IndexOf(startKeyword)
Assign endIndex = pdfText.IndexOf(endKeyword)

Assign tableData = pdfText.Substring(startIndex, endIndex - startIndex)

Build DataTable activity (output: extractedTable)
Add Data Column activities for each column in the table

Assign rows = tableData.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For Each row In rows
    Assign columns = row.Split(","c)  // Assuming the data is comma-separated
    Add Data Row activity (Array: columns) to extractedTable

Cheers…!

Niranjan_k · December 5, 2023, 1:55pm

@Dilli_Reddy could please share the workflow not sure where to apply this logic. Just want to know which tools we need to use build the logic

copy_writes · December 5, 2023, 2:04pm

UiPath.PDF.Activities package

Create a Python Script:

Write a Python script that takes a PDF file as input, extracts text data, and outputs relevant information based on your starting and ending name references. You can use a Python library like PyPDF2 or pdfplumber for this task.

import pdfplumber

def extract_table_details(pdf_path, start_name, end_name):
    with pdfplumber.open(pdf_path) as pdf:
        table_details = []
        for page in pdf.pages:
            text = page.extract_text()
            start_index = text.find(start_name)
            end_index = text.find(end_name)
            if start_index != -1 and end_index != -1:
                table_details.append(text[start_index:end_index])
    return table_details

# Example Usage
pdf_path = "path/to/your/pdf/file.pdf"
start_name = "Table Start"
end_name = "Table End"
result = extract_table_details(pdf_path, start_name, end_name)
print(result)

Invoke Python Script from UiPath:

Use the “Invoke Python Method” activity in UiPath to execute your Python script. Pass the PDF file path, start name, and end name as arguments.

Process the Result in UiPath:

Receive the result from the Python script in UiPath and further process or save the extracted table details as needed.

Niranjan_k · December 5, 2023, 2:56pm

@ this logic is not working for me. I’m getting Excel Application Scope error for all the times

copy_writes · December 6, 2023, 4:32am

Can you please share me the sample data.

Niranjan_k · December 6, 2023, 11:10am

@copy_writes Sorry I do not have access to share upload access. I’m creating this request on personal system. In PDF file I want to extract table data it is there in multiple pages full table data I want to extract based on start date and end date

Topic		Replies	Views
How to extract the data from pdf between two names Activities pdf	10	384	December 6, 2023
Extract table spanning multiple pages of pdf Help datatable , activities , studio	1	1470	September 24, 2018
How to extract tables when multiple pages in pdf file Studio studio , question , activities_panel	9	771	November 23, 2023
How to extract multiple text details and table info from PDF file Studio	6	497	October 31, 2023
PDF table data extract between two strings Activities pdf	5	338	December 6, 2023

Data scraping is not reading full table details in multiple pages in PDF

Related topics