I am encountering an error when creating a pipeline run (train) for an OOTB NER model in ai center. I converted my dataset to JSON format as per documentation (AI Center - Custom Named Entity Recognition).
Despite following the format shown in documentation (and trying various layouts for the json), the pipeline run fails and I receive the below error:
ERROR: Training failed for pipeline type: TRAIN_ONLY, error: list indices must be integers or slices, not str
As mentioned above, I’ve tried various layouts (these are only 2 of many more layouts I have tried, including the layout in documentation):
[
{
"text": "33 COLONIAL CT WEST MIFFLIN, WA 17340",
"entities": [
{
"entity": "ADDR_NUMBER",
"value": "33",
"start_index": 0,
"end_index": 2
},
{
"entity": "STREET",
"value": "COLONIAL CT",
"start_index": 3,
"end_index": 14
},
{
"entity": "CITY",
"value": "WEST MIFFLIN",
"start_index": 15,
"end_index": 27
},
{
"entity": "STATE",
"value": "WA",
"start_index": 29,
"end_index": 31
},
{
"entity": "ZIPCODE",
"value": "17340",
"start_index": 32,
"end_index": 37
}
]
}
]
{
"data": [
{
"text": "123 Main Street Springfield, IL 12345",
"entities": [
{
"start": 0,
"end": 3,
"entity": "ADDR_NUMBER"
},...
I’ve tried every combination of environment variables:
-
dataset.input_format:
json -
dataset.input_column_name:
text -
dataset.output_column_name:
entities -
dataset.input_format:
json -
dataset.input_column_name:
data.text -
dataset.output_column_name:
data.entities
I’ve tried 1, 100, and 8000 records from a dataset standpoint. I’ve tried removing all special characters, including commas (even though these aren’t excluded in the documentation’s example). I’ve tried converting to json from both csv and csv UTF-8 formats. The only way I can get past this error is by having one record like so:
{
"text": "123 Main Street Springfield, IL 12345",
"entities": [
{
"entity": "ADDR_NUMBER"
"value": "123",
"start": 0,
"end": 3,
},...
but that fails due to there only being 1 record. When I add two records like so…
{
"text": "123 Main Street Springfield, IL 12345",
"entities": [
{
"entity": "ADDR_NUMBER"
"value": "123",
"start": 0,
"end": 3,
},...
{
"text": "123 Main Street Springfield, IL 12345",
"entities": [
{
"entity": "ADDR_NUMBER"
"value": "123",
"start": 0,
"end": 3,
},...
it fails with an “ERROR: extra data at line xyz”. I think it’s safe to assume it’s not the data itself that is causing the issue. I’ve exhausted all possibilities. I even tried using the UiPath’s own JSON from the above documentation and that failed (improper escape character error. If I resolve this error then add a second record it fails due to “extra data at line xyz”
I’ve explored UiPath forums and the closest thing I could find related to my issue can be seen here:
The initial step in resolving this issue as per the above post seems to be that I need to convert my training data to CoNLL format. Despite this being a relatively new post, this answer seems to be outdated as it is directly contradicted by UiPath documentation itself (link below):
Custom Named Entity Recognition:
“This model allows you to bring your own dataset tagged with entities you want to extract. The training and evaluation datasets need to be in either CoNLL or JSON format.”
I’m not sure what I am doing wrong (perhaps I am overlooking something painfully obvious). At this point I am considering moving to an alternate format, but given the difficulty I am have with the json format, I’m a little reluctant to try another format only to find out that it too fails. Any feedback or suggestions are greatly appreciated.