Extracting Datatable in a PDF

devasaiprasad_K · July 27, 2024, 5:03pm

HI Team,

i am trying to extract Readable Data in a PDF which is in table format, but after extracting the data is not coming in a table format.

i am sure Activity Table Extraction will not work here, can you pls suggest how can i achieve this.

vrdabberu · July 27, 2024, 5:07pm

Hi @devasaiprasad_K

Can you share thd screenshot of how the PDF looks like.

Regards

ashokkarale · July 28, 2024, 3:35am

@devasaiprasad_K,

I would suggest to use document understanding for this. It would be more reliable and accurate.

Thanks,
Ashok

sandyarpa767 · July 28, 2024, 6:32am

Hi @devasaiprasad_K ,

Its definetly suggested to go with DU concept as the doc type u are trying to extract is semi structurrd . If its a form where u have predefined filelds and no change of position u c can use regex but i believe the tdble will be dynamic

Anil_G · July 28, 2024, 8:35pm

@devasaiprasad_K

Try using form extractors first if the table structure is constant then it can solve the extraction easily
If 1 does not work then better to use document understanding to extract the table

Cheers

singh_sumit · July 29, 2024, 7:34am

Hie @devasaiprasad_K for that you can use Document Understanding and also use form extractor with the help of document understanding you can extract your data in a table format and it easy and also very fast. way to extract table format data…
cheers Happy Automation…

rmorgan · July 29, 2024, 8:37am

I have done this by using python in the past.

from tabula import read_pdf
import pandas as pd
import os

# Path to your PDF file
pdf_path = 'INPUT'

# Define your static output folder
output_folder = 'OUTPUTFOLDER'

# Check if the output folder exists, create it if it doesn't
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Use read_pdf function with pages="all" to extract tables from all pages
tables = read_pdf(pdf_path, pages="all", multiple_tables=True)

# Iterate over each table (DataFrame) and process or save it
for i, table in enumerate(tables):
    # Define the path for each output CSV file within the output folder
    output_path = os.path.join(output_folder, f"FILENAME_{i+1}.csv")
    # Save each table to a CSV file in the specified output folder
    table.to_csv(output_path, index=False)from tabula import read_pdf
import pandas as pd
import os

# Path to your PDF file
pdf_path = 'INPUT'

# Define your static output folder
output_folder = 'OUTPUTFOLDER'

# Check if the output folder exists, create it if it doesn't
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Use read_pdf function with pages="all" to extract tables from all pages
tables = read_pdf(pdf_path, pages="all", multiple_tables=True)

# Iterate over each table (DataFrame) and process or save it
for i, table in enumerate(tables):
    # Define the path for each output CSV file within the output folder
    output_path = os.path.join(output_folder, f"FILENAME_{i+1}.csv")
    # Save each table to a CSV file in the specified output folder
    table.to_csv(output_path, index=False)

system · August 4, 2024, 5:00pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extract Table data from PDF Help datatable , studio	19	16567	August 29, 2019
Query related to PDF extraction through Document Activities activities , question , document_understanding	4	502	January 25, 2023
How to Extract Tabular data from pdf Activities pdf , activities , question	1	897	August 15, 2021
How to extract table from pdf file without using document understanding and regex to an excel sheet Studio	3	2534	February 1, 2024
How to extract one table from pdf which consists multiple tables without using document understanding? Robot	11	1760	February 24, 2021

Extracting Datatable in a PDF

Related topics