Hi,
My pipeline failed and I am unaware how the package has affected this to fail,
Please can someone help me figure out this.
Full training of ae3cb5393fd97a11c 23.4.1.0 launched - Run a2314c71-99a6-4171-b51b-827cd2013f02
Full training of ae3cb5393fd97a11c 23.4.1.0 started - Run a2314c71-99a6-4171-b51b-827cd2013f02
Full training of ae3cb5393fd97a11c 23.4.1.0 scheduled - Run a2314c71-99a6-4171-b51b-827cd2013f02
Full training of ae3cb5393fd97a11c 23.4.1.0 failed - Run a2314c71-99a6-4171-b51b-827cd2013f02
Error Details : Pipeline failed due to ML Package Issue
15:46,729 - root:train:264 - INFO: dataloader workers: 0
2023-06-14 10:15:46,729 - root:train:273 - INFO: Training for 100 epochs…
2023-06-14 10:15:46,730 - root:train:275 - INFO: Training Set Size: 262 samples
2023-06-14 10:15:46,730 - root:train:278 - INFO: Test Set Size: 0 samples
2023-06-14 10:15:46,731 - root:train:298 - INFO: Training for 100 epochs …
2023-06-14 10:28:24,468 - root:_log_to_console:857 - INFO: Epoch 001 [TRAIN][loss_tag/all: 0.3899][acc_tag/all: 0.0729][lr: 0.001000]
2023-06-14 10:28:24,483 - root:_log_to_console:857 - INFO: Epoch 001 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 10:28:24,483 - root:train:308 - INFO: Forward activation cache fully filled for frozen backbone training…
2023-06-14 10:34:44,593 - root:_log_to_console:857 - INFO: Epoch 002 [TRAIN][loss_tag/all: 0.0691][acc_tag/all: 0.3505][lr: 0.001000]
2023-06-14 10:34:44,604 - root:_log_to_console:857 - INFO: Epoch 002 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 10:41:01,250 - root:_log_to_console:857 - INFO: Epoch 003 [TRAIN][loss_tag/all: 0.0148][acc_tag/all: 0.6224][lr: 0.001000]
2023-06-14 10:41:01,262 - root:_log_to_console:857 - INFO: Epoch 003 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 10:47:20,196 - root:_log_to_console:857 - INFO: Epoch 004 [TRAIN][loss_tag/all: 0.0137][acc_tag/all: 0.7665][lr: 0.001000]
2023-06-14 10:47:20,208 - root:_log_to_console:857 - INFO: Epoch 004 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 10:53:39,836 - root:_log_to_console:857 - INFO: Epoch 005 [TRAIN][loss_tag/all: 0.0060][acc_tag/all: 0.8606][lr: 0.001000]
2023-06-14 10:53:39,849 - root:_log_to_console:857 - INFO: Epoch 005 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:00:00,287 - root:_log_to_console:857 - INFO: Epoch 006 [TRAIN][loss_tag/all: 0.0039][acc_tag/all: 0.9160][lr: 0.001000]
2023-06-14 11:00:00,299 - root:_log_to_console:857 - INFO: Epoch 006 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:06:25,083 - root:_log_to_console:857 - INFO: Epoch 007 [TRAIN][loss_tag/all: 0.0025][acc_tag/all: 0.9492][lr: 0.001000]
2023-06-14 11:06:25,095 - root:_log_to_console:857 - INFO: Epoch 007 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:12:49,418 - root:_log_to_console:857 - INFO: Epoch 008 [TRAIN][loss_tag/all: 0.0055][acc_tag/all: 0.9555][lr: 0.001000]
2023-06-14 11:12:49,429 - root:_log_to_console:857 - INFO: Epoch 008 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:19:16,702 - root:_log_to_console:857 - INFO: Epoch 009 [TRAIN][loss_tag/all: 0.0023][acc_tag/all: 0.9695][lr: 0.001000]
2023-06-14 11:19:16,713 - root:_log_to_console:857 - INFO: Epoch 009 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:25:41,900 - root:_log_to_console:857 - INFO: Epoch 010 [TRAIN][loss_tag/all: 0.0021][acc_tag/all: 0.9771][lr: 0.001000]
2023-06-14 11:25:41,912 - root:_log_to_console:857 - INFO: Epoch 010 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:32:11,334 - root:_log_to_console:857 - INFO: Epoch 011 [TRAIN][loss_tag/all: 0.0017][acc_tag/all: 0.9829][lr: 0.001000]
2023-06-14 11:32:11,346 - root:_log_to_console:857 - INFO: Epoch 011 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:38:32,849 - root:_log_to_console:857 - INFO: Epoch 012 [TRAIN][loss_tag/all: 0.0008][acc_tag/all: 0.9883][lr: 0.001000]
2023-06-14 11:38:32,861 - root:_log_to_console:857 - INFO: Epoch 012 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:44:52,550 - root:_log_to_console:857 - INFO: Epoch 013 [TRAIN][loss_tag/all: 0.0013][acc_tag/all: 0.9898][lr: 0.001000]
2023-06-14 11:44:52,562 - root:_log_to_console:857 - INFO: Epoch 013 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:51:12,559 - root:_log_to_console:857 - INFO: Epoch 014 [TRAIN][loss_tag/all: 0.0025][acc_tag/all: 0.9852][lr: 0.001000]
2023-06-14 11:51:12,572 - root:_log_to_console:857 - INFO: Epoch 014 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 11:57:34,132 - root:_log_to_console:857 - INFO: Epoch 015 [TRAIN][loss_tag/all: 0.0013][acc_tag/all: 0.9879][lr: 0.001000]
2023-06-14 11:57:34,143 - root:_log_to_console:857 - INFO: Epoch 015 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:03:56,642 - root:_log_to_console:857 - INFO: Epoch 016 [TRAIN][loss_tag/all: 0.0011][acc_tag/all: 0.9900][lr: 0.001000]
2023-06-14 12:03:56,653 - root:_log_to_console:857 - INFO: Epoch 016 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:10:14,623 - root:_log_to_console:857 - INFO: Epoch 017 [TRAIN][loss_tag/all: 0.0020][acc_tag/all: 0.9871][lr: 0.001000]
2023-06-14 12:10:14,636 - root:_log_to_console:857 - INFO: Epoch 017 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:16:36,067 - root:_log_to_console:857 - INFO: Epoch 018 [TRAIN][loss_tag/all: 0.0501][acc_tag/all: 0.9182][lr: 0.001000]
2023-06-14 12:16:36,080 - root:_log_to_console:857 - INFO: Epoch 018 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:22:57,558 - root:_log_to_console:857 - INFO: Epoch 019 [TRAIN][loss_tag/all: 0.0022][acc_tag/all: 0.9528][lr: 0.001000]
2023-06-14 12:22:57,571 - root:_log_to_console:857 - INFO: Epoch 019 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:29:20,596 - root:_log_to_console:857 - INFO: Epoch 020 [TRAIN][loss_tag/all: 0.0008][acc_tag/all: 0.9735][lr: 0.001000]
2023-06-14 12:29:20,608 - root:_log_to_console:857 - INFO: Epoch 020 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:35:42,223 - root:_log_to_console:857 - INFO: Epoch 021 [TRAIN][loss_tag/all: 0.0006][acc_tag/all: 0.9843][lr: 0.001000]
2023-06-14 12:35:42,234 - root:_log_to_console:857 - INFO: Epoch 021 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:42:01,800 - root:_log_to_console:857 - INFO: Epoch 022 [TRAIN][loss_tag/all: 0.0004][acc_tag/all: 0.9902][lr: 0.001000]
2023-06-14 12:42:01,811 - root:_log_to_console:857 - INFO: Epoch 022 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:48:22,597 - root:_log_to_console:857 - INFO: Epoch 023 [TRAIN][loss_tag/all: 0.0006][acc_tag/all: 0.9929][lr: 0.001000]
2023-06-14 12:48:22,609 - root:_log_to_console:857 - INFO: Epoch 023 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 12:54:40,307 - root:_log_to_console:857 - INFO: Epoch 024 [TRAIN][loss_tag/all: 0.0010][acc_tag/all: 0.9933][lr: 0.001000]
2023-06-14 12:54:40,319 - root:_log_to_console:857 - INFO: Epoch 024 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 13:00:58,053 - root:_log_to_console:857 - INFO: Epoch 025 [TRAIN][loss_tag/all: 0.0004][acc_tag/all: 0.9955][lr: 0.001000]
2023-06-14 13:00:58,065 - root:_log_to_console:857 - INFO: Epoch 025 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 13:07:16,244 - root:_log_to_console:857 - INFO: Epoch 026 [TRAIN][loss_tag/all: 0.0002][acc_tag/all: 0.9967][lr: 0.000316]
2023-06-14 13:07:16,256 - root:_log_to_console:857 - INFO: Epoch 026 [TEST ][loss_tag/all: nan][acc_tag/all: nan]
2023-06-14 13:07:16,257 - root:train:315 - INFO: Stopping training at epoch 26 after 26 epochs without improvement.
2023-06-14 13:07:16,257 - root:train:319 - INFO: Training complete. Score -1000000.0000 Epoch 0
2023-06-14 13:07:16,257 - root:save:227 - INFO: Saving model object…
2023-06-14 13:07:16,270 - root:save_network:265 - INFO: Saving network state_dict to disk…
2023-06-14 13:07:27,276 - root:train:126 - ERROR: No best model saved. Try training for more epochs or add more data to your training set.
NoneType: None
2023-06-14 13:07:27,276 - root:train:172 - ERROR: multi_task_base Extraction Model Training Failed
Traceback (most recent call last):
File “”, line 132, in train
FileNotFoundError: [Errno 2] No such file or directory: ‘/workspace/model/microservice/models/multi_task_base/network.p’
2023-06-14 13:07:27,281 - UiPath_core.training_plugin:trigger_full_training_and_publish_model:599 - ERROR: Failed to trigger full training and publish data, error: multi_task_base Extraction Model Training Failed
2023-06-14 13:07:27,282 - UiPath_core.training_plugin:model_run:189 - ERROR: Training failed for pipeline type: FULL_TRAINING, error: multi_task_base Extraction Model Training Failed
2023-06-14 13:07:27,319 - UiPath_core.trainer_run:main:91 - ERROR: Training Job failed, error: multi_task_base Extraction Model Training Failed
Traceback (most recent call last):
File “”, line 132, in train
FileNotFoundError: [Errno 2] No such file or directory: ‘/workspace/model/microservice/models/multi_task_base/network.p’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/model/bin/UiPath_core/trainer_run.py”, line 86, in main
wrapper.run()
File “/workspace/model/microservice/training_wrapper.py”, line 64, in run
return self.training_plugin.model_run()
File “/model/bin/UiPath_core/training_plugin.py”, line 205, in model_run
raise ex
File “/model/bin/UiPath_core/training_plugin.py”, line 177, in model_run
self.run_full_training()
File “/model/bin/UiPath_core/training_plugin.py”, line 226, in run_full_training
self.trigger_full_training_and_publish_model()
File “/model/bin/UiPath_core/training_plugin.py”, line 600, in trigger_full_training_and_publish_model
raise e
File “/model/bin/UiPath_core/training_plugin.py”, line 573, in trigger_full_training_and_publish_model
self.train_model(self.training_data_directory)
File “/model/bin/UiPath_core/training_plugin.py”, line 131, in train_model
response = self.model.train(directory)
File “/workspace/model/microservice/train.py”, line 53, in train
self.trainer.train(self.dataset)
File “”, line 172, in train
Exception: multi_task_base Extraction Model Training Failed
2023-06-14 13:07:27,319 - UiPath_core.trainer_run:main:98 - INFO: Job run stopped.