Minimal number of verbatims in Communications Mining

Hi everybody,

I’m currently working on a topic related to Communications Mining. Once I entered my data sources, it’s writing “This dataset has insufficient verbatims in it for training”.
Does anyone know the minimal number of verbatims we need to train a model please?

Have a nice day

@jouneid.guefif

Welcome to the community

I believe this would help Communications Mining - Model Training FAQs

cheers

Thanks !

I can’t find the exact minimal number that UiPath need in order to perform.I could just find that I need 12 months of data but I really like to know the exact number please.

Best

Hi @jouneid.guefif,

I don’t think that’s a big number, if it exists at all. For me the model started updating with as little as 24 uploaded emails.

There are, however, some caps that are suggested or you might want to exceed for certain things:

  • Recommended amount for a good model performance: 10k
  • Cluster recommendations in Discover tab - something around 2k and more
  • Getting F1 score for a label: at least 25 pinned examples per that label
  • Getting some first predictions: worked for me with as little as 60 pinned examples.

EDIT (12.10.2024): guided training (Train tab) starts from about +5k messages.

Larger volumes will probably boost the model’s overall performance, but the bacis mechanics should work even with small numbers.

EDIT (05.02.2025): general rules for ML datasets is also a good starting point: varied, balanced, large dataset with as little NaNs/Nulls/Blanks as possible. Generally, try to get as close to reality as possible - imagine how model should work in production on whole mailbox (or many) for a longer period of time (try to figure out if you have some particular time patterns in your use cases, e.g. months closing). Large volume helps to eliminate coincidences out of the equation.

Cheers
Tom

1 Like

Hi @jouneid.guefif, maybe this helps. This is what’s on the UiPath slides.