Extraction of table data from pdf

Hello all,
I am trying to extract table data from pdf and write it in Excel.“Generate data table from text activity” is not working for me because the text is in a very complex manner. Can anyone suggest me another method or any direct activity available to extract the table from pdf?“Its very urgent”

Hi @Amit_Kumar_Charde

If it is Scanned document take Read PDF with OCR activity otherwise take Read PDF Text activity

  1. Drag and drop the “Read PDF with OCR” activity into your workflow.
  2. Configure the activity by specifying the input PDF file path and selecting the OCR engine (e.g., Google OCR, Microsoft OCR, or Abbyy OCR).
  3. Use the output variable of the “Read PDF with OCR” activity, let’s call it pdfText, which contains the extracted text from the PDF.
  4. Apply text manipulation techniques, such as string splitting or regular expressions, to extract the table data from the pdfText variable.
  5. Construct a DataTable to hold the extracted table data.
  6. Iterate through the extracted data and populate the DataTable.
  7. Use the “Write Range” activity to write the DataTable to an Excel file.

I hope it helps!!

My text is in a very complex manner so string manipulation is not working here as I tried this many times.

@Amit_Kumar_Charde

Can you provide sample pdf how it looks then we will understand how to do.
Try with Document Understanding

1 Like

@Amit_Kumar_Charde

You can try using form extractor or documen tunderstanding for the same

Or try if you are able to open the pdf using word activities if so the table can be extracted from word instead

Cheers

Okk Sure I will be trying this

1 Like

Hi @Amit_Kumar_Charde ,

We would not be able to help effectively if the details are vague, let us know what is meant by complex, If there are going to be different variations in the format/Template of the PDF, Is it PDF always going to be Digital or Scanned or Mixture of both.

These details would help us provide you with suggestions that is more towards your particular case.

The pdf is digital but after it is converted to text it appears to be very jumbled means data coincides with each other.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.