Extracting Datatable in a PDF

HI Team,

i am trying to extract Readable Data in a PDF which is in table format, but after extracting the data is not coming in a table format.

i am sure Activity Table Extraction will not work here, can you pls suggest how can i achieve this.

Hi @devasaiprasad_K

Can you share thd screenshot of how the PDF looks like.

Regards

@devasaiprasad_K,

I would suggest to use document understanding for this. It would be more reliable and accurate.

Thanks,
Ashok :slightly_smiling_face:

Hi @devasaiprasad_K ,

Its definetly suggested to go with DU concept as the doc type u are trying to extract is semi structurrd . If its a form where u have predefined filelds and no change of position u c can use regex but i believe the tdble will be dynamic

@devasaiprasad_K

  1. Try using form extractors first if the table structure is constant then it can solve the extraction easily
  2. If 1 does not work then better to use document understanding to extract the table

Cheers

Hie @devasaiprasad_K for that you can use Document Understanding and also use form extractor with the help of document understanding you can extract your data in a table format and it easy and also very fast. way to extract table format data…
cheers Happy Automation…

I have done this by using python in the past.

from tabula import read_pdf
import pandas as pd
import os

# Path to your PDF file
pdf_path = 'INPUT'

# Define your static output folder
output_folder = 'OUTPUTFOLDER'

# Check if the output folder exists, create it if it doesn't
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Use read_pdf function with pages="all" to extract tables from all pages
tables = read_pdf(pdf_path, pages="all", multiple_tables=True)

# Iterate over each table (DataFrame) and process or save it
for i, table in enumerate(tables):
    # Define the path for each output CSV file within the output folder
    output_path = os.path.join(output_folder, f"FILENAME_{i+1}.csv")
    # Save each table to a CSV file in the specified output folder
    table.to_csv(output_path, index=False)from tabula import read_pdf
import pandas as pd
import os

# Path to your PDF file
pdf_path = 'INPUT'

# Define your static output folder
output_folder = 'OUTPUTFOLDER'

# Check if the output folder exists, create it if it doesn't
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Use read_pdf function with pages="all" to extract tables from all pages
tables = read_pdf(pdf_path, pages="all", multiple_tables=True)

# Iterate over each table (DataFrame) and process or save it
for i, table in enumerate(tables):
    # Define the path for each output CSV file within the output folder
    output_path = os.path.join(output_folder, f"FILENAME_{i+1}.csv")
    # Save each table to a CSV file in the specified output folder
    table.to_csv(output_path, index=False)

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.