Model Training Failure With CPUs and Quite Large Dataset

Machine Learning Model training with CPUs and a quite large dataset (3000+ pages) fails after long training time on on-prem Airgapped installed AI Center.

Root Cause: The training run takes a long to run which exceeds the default timeout of 7 days.

Resolution: Pipelines are automatically killed if it stays in running state for more than 7 days.

To overcome this , set environment variable ie JOB_TIMEOUT_AFTER_DAYS with the number of days as value , at the time of creating pipeline (such as JOB_TIMEOUT_AFTER_DAYS=14).