Hello I need some ideas how to extract data from some badly designed PDF. I’ve tried python, genAI (image analysis), power query in excel etc but all of them failed placing the values in the correct bucket column. Any idea how I can get this done would be really appreciated.
Hi @Lynn_Song
This PDF has no real table structure. Table extractors will fail.
Extract text/words with coordinates and map values to columns based on X-position (or use OCR with fixed zones per column). This position-based logic is the only reliable way for badly designed PDFs.
Hi @Lynn_Song
Try to use document understanding with anchored templates or form extractor, because they extract fields based on anchors instead of layout. If layout varies, use Intelligent Form Extractor with few samples. OCR plus Regex Extractor works only for simple fixed fields.
Happy Automation
