Model Training Failure With CPUs and Quite Large Dataset

Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after long training time on on-prem Airgapped installed AI Center.

Issue Description:

Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after a long training time in Automation Suite 23.4 AI Center.

In Automation Suite AI Center The training run takes a long time to run which exceeds the default timeout of 7 days.

Resolution:

Pipelines will be automatically killed if staying in a running state for more than the default period of days.
To overcome this, while creating a pipeline in AI Center, add an environment variable JOB_TIMEOUT_AFTER_DAYS and give a value of the day number on the pipeline property detailed page.

Please refer to the below picture for example (JOB_TIMEOUT_AFTER_DAYS=14):

image.png