I’m having a pdf file with table form of data,I want to extract each line from the pdf table.
So how to train the pdf using load taxonomy.
Please check below thread for your reference.
I have already checked the given link,but i didn’t get the right solution for what i asked.so could you suggest the solution.
I guess for this no need of Taxonomy manager. You can read the pdf file directly and use string manipulation functions to do this.
Fine
we can do one thing
–if we have the adobe licensed version we can export that pdf to word doc file and using microsoft.office.interops.word namespace we can get the table details very easily
or
–but if we dont have adobe application then we can read the pdf with READ PDF or READ PDF OCR and get the output with string variable
then using REGEX or SPLIT method we can get the value we want
Cheers @abdul0811
Yes,Since pdf table format is changed in each record.so if i use string manipulation(regex pattern) didn’t get the proper set of data.
Hello @abdul0811
The taxonomy manager is only a way to define the metadata (fields, information) that you are looking for in a particular file.
To actually extract the informaiton, you need to use some sort of a data extractor.
If you need ALL lines from the file, you can try the Read PDF Text activity and just grab the lines.
If you need Specific information from those files, think about:
- how similar / different are the files that need to be processed?
- is the information always in the same place?
- is this an invoices or receipts use case?
Based on the answers (and ideally a sample file) we could help in finding some solutions…
Thanks,
Ioana