If I’m treating elements in rows as column headers, the data doesn’t consistently align with the same columns across different PDFs. For example, in some templates, ‘mn’ might be in the 3rd column, while in others, it could be in a different column. As a result, the extraction output is incorrect. I’ve trained the system with nearly 50 different PDF templates.
If I’m using ‘Elements’, ‘Maximum’, ‘Minimum’, and ‘Actual’ as column headers, the output appears as shown in the image. However, in some cells, it fails to extract the ‘-’ symbol.
Please provide a solution for accurately extracting data from PDFs.
If I’m using ‘Elements’, ‘Maximum’, ‘Minimum’, and ‘Actual’ as column headers, the output appears as shown in the image. However, in some cells, it fails to extract the ‘-’ symbol.
making the resolution better might help…looks like the - are not able to be detected because of the same reason
Do you have a static column? atleast if the column names are static then that would be a better option…that way each column will definitely be separated
try to train with more documents that contain - …this way different variations can be considered…any model betters as you use…try to include auto retrain and uploading documents which are not fully extracted so that over time it gets better
all the columns you indicate should have common columns…not only onw…because it would need identifiers
not predict…but from studio if you feel any document is not extracted proeprly we can auto upload it to dataset and then rerun the pipeline…this improves the accuracy overtime…but not immediately