Line item labelling best practices

Hi guys, I am currently labelling a statements skill. One of the columns are description and another one of the6 columns is invoice no. I have some documents where Invoice number are within the description like this (image attached)

By me labelling the invoice number that exists within the description, is that confusing the skill and reducing performance?

Another question: If I tagged two a regular field in two places with that improvement or weaken our skill?

Thanks guys.

Kind Regards
Kyle

@Kyle_Gounden

  1. For different types its better to classify and indicate rather it be line items or fields
  2. What is the need to tag in multiple places?

cheers

Q-1 answer ->@Kyle_Gounden Firstly identify the templates having similar scenarios. With these 2 screenshots I can see text after # is your invoice no. So I would suggest to manage these in post processing.

Q2 answer → Try to avoid labelling the same data in multiple fields. It will reduce the confidence in ML Skill training and ML Skill as well.
Try to handle such scenarios at code level by string manipulation either by identifying vendor name, customer name or something specific to the document which are causing issue.

When labeling within tables, especially if there are multiple pieces of information in a cell, it’s generally better to select and label all contents within the relevant column cell. After extracting the whole cell’s data, specific information can be obtained using string manipulation or RegEx. This ensures that the model captures data from the correct location and allows for further processing as needed.

As for the second question, instead of creating two separate regular fields for the same type of data, it might be more efficient to define a single regular field with the “multi-value” option. This way, multiple values can be extracted under one field, avoiding unnecessary complexity. This approach can simplify the labeling process and improve the model’s accuracy.