Need Help with OCR-Based Medication List Extraction from Variable PDF Formats

I am currently working on a project that involves extracting a medication list from a PDF file using OCR technology.

The challenge lies in the varying number of pages in the PDF, which can range from 6 to 14 or more. Additionally, the medication list may appear on any page and often spans across one or two subsequent pages. The list is typically presented in a table format, adding to the complexity of the extraction process.

I would greatly appreciate any guidance or suggestions from the forum on how to effectively handle these challenges. If anyone has experience dealing with similar scenarios or can recommend best practices or tools, please share your insights.

Thank you!

@jai_kumar2

If you are on latest version try using docpth model …its a generative ai mode ehich can work on dynamic pdfs

Alternately…identify the table column names as key words and first try to loop through each oge and find whcih pages have those keywords and separate those pges

Then you can leverage form extractor if the table structure is same across

Else train a model to extract those tables and feed only the pges with those keywords and get the data

Cheers