Model Training Failure With CPUs and Quite Large Dataset

system · December 29, 2022, 1:42pm

Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after long training time on on-prem Airgapped installed AI Center.

Issue Description:

Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after a long training time in Automation Suite 23.4 AI Center.

In Automation Suite AI Center The training run takes a long time to run which exceeds the default timeout of 7 days.

Resolution:

Pipelines will be automatically killed if staying in a running state for more than the default period of days.
To overcome this, while creating a pipeline in AI Center, add an environment variable JOB_TIMEOUT_AFTER_DAYS and give a value of the day number on the pipeline property detailed page.

Please refer to the below picture for example (JOB_TIMEOUT_AFTER_DAYS=14):

Topic		Replies	Views
AI Center Pipeline Running Since A Long Time Knowledge Base ai_center	0	915	September 14, 2021
Failed Training Pipeline: Retraining an older ML Package AI Center question , ai_center	1	801	November 9, 2022
Model Deployment in Pipeline AI Center feedback , ai_center	0	839	January 26, 2021
"Error stopping ML skill" AI Center question , ai_center , mlskill , pipeline , ml-skills	6	797	June 29, 2023
Machine Learning Error AI Center question , ai_center	3	925	March 23, 2022

Model Training Failure With CPUs and Quite Large Dataset

Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after long training time on on-prem Airgapped installed AI Center.

Issue Description:

Resolution:

Related topics