Hi Team,
I am trying to extract data from tables in PDF files using Extract Tables and Selectors in UiAutomation.However, the Test Selection shows proper data but final result there are missing values, Word spell errors and the position of words changes.
Overall, the Extraction is inconsistent when we process multiple files in a folder.
The tables are somewhat complex with variations.
Help me resolve this.
Thanks,
Sunita
Hi @Ananta_Sunita
Try using CV activities for table extraction.
You may get better results.
Hope this helps
Hi @Ananta_Sunita `
Try this python script to extract pdf values, Hope it works let me know.
import pandas as pd
from pdfminer.high_level import extract_text
import re
def extract_table_from_pdf(pdf_path):
# Extract raw text from PDF
raw_text = extract_text(pdf_path)
# Use regex to find table patterns (customize based on your table structure)
table_pattern = re.compile(r"your_table_regex_pattern")
tables = table_pattern.findall(raw_text)
# Parse and clean table data
table_data = []
for table in tables:
rows = table.split('\n')
for row in rows:
cells = row.split() # or use a more sophisticated splitter
table_data.append(cells)
# Convert to DataFrame for easier manipulation
df = pd.DataFrame(table_data)
# Perform data cleaning and validation
# Example: drop empty columns, handle missing values, etc.
df.dropna(how='all', axis=1, inplace=True)
df.fillna('N/A', inplace=True)
return df
Example usage
pdf_path = ‘path/to/your/pdf_file.pdf’
df = extract_table_from_pdf(pdf_path)
print(df)
- Use the Invoke Python Method activity in UiAutomation to call your Python script.
- Pass the PDF Path as an argument to the Python script.
- Retrieve the DataFrame and process it further within UiAutomation.
Thanks