Brainstorming Solutions for Editing Data in PDFs

I have a set of PDF files that contain two or three tables each and I wish to extract and store the tables into excel, but unfortunately, the activities present such as OCR and Read PDF Text won’t generate the desired outcome.

Is there any way to capture tables present inside PDF’s(Structured or unstructured) and store them into excel?
Help would be appreciated, and thanks in advance!
P.S. I’ve tried using EpsilonAI.Activities, that didn’t work either. If there are any other activities that will help with this, please do mention them.

@ashwin.ashok - Can you try this?

If it doesn’t work out, i would suggest to try Document Understanding(DU). It will work. If you have a sample pdf, can you please share(after redacting ). I have a DU workflow , i can try here in parallel.

Hi @prasath17, I have tried it with the EpsilonAI package, but the tables aren’t getting recorded. I’ll include the same PDF in this comment.Sample.pdf (65.3 KB)

Hi @ashwin.ashok,

If we assume your tables in the PDF have a standard pattern when the text is extracted, then there are two possible approaches (csv format is the savior in both):

Approach 1: Using only PDF activities

Suggested workflow: Main.xaml (12.4 KB)
Results first saved to temp.csv

Approach 2 - Open Pdf in word and extract the specific tables from word
Yes, you can open PDF files in word. Some pdfs wont work so well and will lose formating in word, but most structured ones will.

  1. Read PDF in word.exe
  2. Manipulate / convert the read text to a csv format (Hurdle! Multi level headers and multi values in single rows will lose formatting)
  3. Handling the formatting
  4. Write the resulted CSV text string to a temp.csv
  5. Read the CSV and

Major Part of this solution is from : How to read table in a Word document thanks to @vvaidya

Workflow from the above link (slight mofications): wordTables.xaml (9.6 KB)
Results first saved to wordCsv.csv

Hope this helps!