Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after long training time on on-prem Airgapped installed AI Center.
Issue Description:
Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after a long training time in Automation Suite 23.4 AI Center.
In Automation Suite AI Center The training run takes a long time to run which exceeds the default timeout of 7 days.
Resolution:
Pipelines will be automatically killed if staying in a running state for more than the default period of days.
To overcome this, while creating a pipeline in AI Center, add an environment variable JOB_TIMEOUT_AFTER_DAYS and give a value of the day number on the pipeline property detailed page.
Please refer to the below picture for example (JOB_TIMEOUT_AFTER_DAYS=14):