I am trying to extract data from tables in PDF files using Extract Tables and Selectors in UiAutomation.However, the Test Selection shows proper data but final result there are missing values, Word spell errors and the position of words changes.
Overall, the Extraction is inconsistent when we process multiple files in a folder.
Try this python script to extract pdf values, Hope it works let me know.
import pandas as pd
from pdfminer.high_level import extract_text
import re
def extract_table_from_pdf(pdf_path):
# Extract raw text from PDF
raw_text = extract_text(pdf_path)
# Use regex to find table patterns (customize based on your table structure)
table_pattern = re.compile(r"your_table_regex_pattern")
tables = table_pattern.findall(raw_text)
# Parse and clean table data
table_data = []
for table in tables:
rows = table.split('\n')
for row in rows:
cells = row.split() # or use a more sophisticated splitter
table_data.append(cells)
# Convert to DataFrame for easier manipulation
df = pd.DataFrame(table_data)
# Perform data cleaning and validation
# Example: drop empty columns, handle missing values, etc.
df.dropna(how='all', axis=1, inplace=True)
df.fillna('N/A', inplace=True)
return df