I have a set of PDF files that contain two or three tables each and I wish to extract and store the tables into excel, but unfortunately, the activities present such as OCR and Read PDF Text won’t generate the desired outcome.
Is there any way to capture tables present inside PDF’s(Structured or unstructured) and store them into excel?
Help would be appreciated, and thanks in advance!
P.S. I’ve tried using EpsilonAI.Activities, that didn’t work either. If there are any other activities that will help with this, please do mention them.
If it doesn’t work out, i would suggest to try Document Understanding(DU). It will work. If you have a sample pdf, can you please share(after redacting ). I have a DU workflow , i can try here in parallel.
Hi @prasath17, I have tried it with the EpsilonAI package, but the tables aren’t getting recorded. I’ll include the same PDF in this comment.Sample.pdf (65.3 KB)
If we assume your tables in the PDF have a standard pattern when the text is extracted, then there are two possible approaches (csv format is the savior in both):
Approach 1: Using only PDF activities
Suggested workflow: Main.xaml (12.4 KB)
Results first saved to temp.csv
Approach 2 - Open Pdf in word and extract the specific tables from word
Yes, you can open PDF files in word. Some pdfs wont work so well and will lose formating in word, but most structured ones will.
Read PDF in word.exe
Manipulate / convert the read text to a csv format (Hurdle! Multi level headers and multi values in single rows will lose formatting)