Labelling a table data

If I’m treating elements in rows as column headers, the data doesn’t consistently align with the same columns across different PDFs. For example, in some templates, ‘mn’ might be in the 3rd column, while in others, it could be in a different column. As a result, the extraction output is incorrect. I’ve trained the system with nearly 50 different PDF templates.

Eg.pdf 1

Eg.pdf 2

If I’m using ‘Elements’, ‘Maximum’, ‘Minimum’, and ‘Actual’ as column headers, the output appears as shown in the image. However, in some cells, it fails to extract the ‘-’ symbol.

Please provide a solution for accurately extracting data from PDFs.

@AbarnaKalaiselvam

Whenever these kind of changes are there…we need to train with almost equal number of different variations else it might not extract as expected

cheers

1 Like

If I’m using ‘Elements’, ‘Maximum’, ‘Minimum’, and ‘Actual’ as column headers, the output appears as shown in the image. However, in some cells, it fails to extract the ‘-’ symbol.

@Anil_G


In some cells i can’t able to label the ‘-’ symbol.
Is there any other method to get that symbol?

can u suggest which option is best for taking header(row or col)?

@AbarnaKalaiselvam

making the resolution better might help…looks like the - are not able to be detected because of the same reason

Do you have a static column? atleast if the column names are static then that would be a better option…that way each column will definitely be separated

try to train with more documents that contain - …this way different variations can be considered…any model betters as you use…try to include auto retrain and uploading documents which are not fully extracted so that over time it gets better

cheers

1 Like

yes this col is static,
image
but my output should be like below.


did u mean predict as auto retrain?
image

@AbarnaKalaiselvam

all the columns you indicate should have common columns…not only onw…because it would need identifiers

not predict…but from studio if you feel any document is not extracted proeprly we can auto upload it to dataset and then rerun the pipeline…this improves the accuracy overtime…but not immediately

cheers