The use case states, we need to extract the tables from the PDF (input files) and collate into a single table in an excel.
While doing that we are extracting the tables from the PDF basically exporting the PDF page(which contains table) into a word document and then copying the table (using table index) from word document to excel.
When we export the PDF to Word document the issue comes up , where some of the texts are getting changed and extracted incorrectly i.e. actual text is getting replaced with unwanted text.
Code we used to export PDF to word :
//
try
{
var wordApplication = new Microsoft.Office.Interop.Word.Application();
var myDocument = wordApplication.Documents.Open(in_PdfFilePath);
myDocument.SaveAs2(in_WordFilePath, WdSaveFormat.wdFormatDocumentDefault);
myDocument.Close();
wordApplication.Quit();
}
catch( Exception e)
{
Console.WriteLine(e.Source.ToString()+" “+e.Message.ToString()+” "+e.TargetSite.ToString());
}