I am currently creating a model to extract document using the document understanding package and when I create a pipeline to train the model i keep encountering data preprocess failure. I double checked on my data labeling, do let me know if I am doing anything wrong.
- Import the documents and uncheck “Make this an evaluation set”
- When selecting the dataset during pipeline creating select the exported dataset
Even after double checking on those steps I kept getting the same error:
Train only of Form24Extractor_Preview 22.6.1-preview.0 scheduled - Run 3841daf8-58ab-4a38-b696-fea7ed433b7b
Train only of Form24Extractor_Preview 22.6.1-preview.0 launched - Run 3841daf8-58ab-4a38-b696-fea7ed433b7b
Train only of Form24Extractor_Preview 22.6.1-preview.0 started - Run 3841daf8-58ab-4a38-b696-fea7ed433b7b
Train only of Form24Extractor_Preview 22.6.1-preview.0 failed - Run 3841daf8-58ab-4a38-b696-fea7ed433b7b
Error Details : Pipeline failed due to ML Package Issue
2022-11-02 02:40:24,848 - uipath_core.trainer_run:main:73 - INFO: Starting training job…
2022-11-02 02:40:28,737 - matplotlib:_get_config_or_cache_dir:484 - WARNING: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-t2u3kjwl because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2022-11-02 02:40:29,508 - matplotlib.font_manager:_load_fontmanager:1443 - INFO: generated new fontManager
2022-11-02 02:40:32,614 - uipath_core.storage.azure_storage_client:download:115 - INFO: Dataset from bucket folder training-d4872c60-f69f-4e21-8a48-55a30feba0e9/f8516607-490b-4ee1-be17-c2fcc3aa4ed6/e957ea1c-3191-4aca-b049-961dac2c5dda/export/Form24SSMExport_22-11-01T072540 with size 38 downloaded successfully
2022-11-02 02:40:32,614 - uipath_core.training_plugin:train_model:116 - INFO: Start model training…
2022-11-02 02:40:32,614 - uipath_core.training_plugin:initialize_model:110 - INFO: Start model initialization…
2022-11-02 02:40:32,615 - root:initialize_package:145 - INFO: Using package type provided by runtime argument with value: du
2022-11-02 02:40:32,615 - root:initialize_package:154 - INFO: Initializing du package options …
2022-11-02 02:40:32,618 - root:configure_options:107 - INFO: Training with random slices: False
2022-11-02 02:40:32,618 - root:configure_options:108 - INFO: Sample by size: False
2022-11-02 02:40:32,618 - root:configure_options:141 - INFO: Determining dataset language for document type du…
2022-11-02 02:40:32,647 - root:configure_options:144 - INFO: Document type du language: en
2022-11-02 02:40:32,647 - root:initialize_package:159 - INFO: System-Level Configuration:
2022-11-02 02:40:32,647 - root:initialize_package:160 - INFO: ATen/Parallel:
at::get_num_threads() : 3
at::get_num_interop_threads() : 2
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 3
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 3
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
std::hardware_concurrency() : 4
Environment variables:
OMP_NUM_THREADS : 3
MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
2022-11-02 02:40:32,648 - root:configure_options:107 - INFO: Training with random slices: False
2022-11-02 02:40:32,648 - root:configure_options:108 - INFO: Sample by size: False
2022-11-02 02:40:32,648 - root:configure_options:144 - INFO: Document type du language: en
2022-11-02 02:40:32,648 - uipath_core.training_plugin:initialize_model:113 - INFO: Model initialized successfully
2022-11-02 02:40:32,648 - root:log_data_version_info:13 - INFO: =========Data version information=========
2022-11-02 02:40:32,664 - root:log_data_version_info:17 - WARNING: Unknown data version:
2022-11-02 02:40:32,664 - root:log_data_version_info:17 - INFO: ==========================================
2022-11-02 02:40:32,664 - root:preprocess_data:575 - INFO: Creating dataset for document type du…
2022-11-02 02:40:32,697 - root:preprocess_data:577 - INFO: Doctype du Statistics:
2022-11-02 02:40:32,697 - root:preprocess_data:580 - INFO:
Extraction fields:
tag = 5287
tag[companyname] = 28
tag[brn] = 11
Subsets:
subset[TEST] = 6
2022-11-02 02:40:32,698 - root:create_processor:43 - INFO: Loading LayoutLMV2 processor from HuggingFace …
2022-11-02 02:40:38,266 - root:preprocess_data:649 - INFO: train: (0, 16) pages
2022-11-02 02:40:38,266 - root:preprocess_data:650 - INFO: test: (0, 16) pages
2022-11-02 02:40:38,266 - root:preprocess_dataset:50 - ERROR: Dataset preprocess Failed
Traceback (most recent call last):
File “”, line 49, in preprocess_dataset
File “”, line 147, in init
File “”, line 35, in init
File “”, line 651, in preprocess_data
AssertionError: Training and / or validation set is empty, verify that training / validation split is correctly set
2022-11-02 02:40:38,269 - uipath_core.training_plugin:model_run:152 - ERROR: Training failed for pipeline type: TRAIN_ONLY, error: Dataset preprocess Failed
2022-11-02 02:40:38,274 - uipath_core.trainer_run:main:90 - ERROR: Training Job failed, error: Dataset preprocess Failed
Traceback (most recent call last):
File “”, line 49, in preprocess_dataset
File “”, line 147, in init
File “”, line 35, in init
File “”, line 651, in preprocess_data
AssertionError: Training and / or validation set is empty, verify that training / validation split is correctly set
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/model/bin/uipath_core/trainer_run.py”, line 85, in main
wrapper.run()
File “/workspace/model/microservice/training_wrapper.py”, line 64, in run
return self.training_plugin.model_run()
File “/model/bin/uipath_core/training_plugin.py”, line 153, in model_run
raise e
File “/model/bin/uipath_core/training_plugin.py”, line 145, in model_run
self.run_train_only()
File “/model/bin/uipath_core/training_plugin.py”, line 214, in run_train_only
self.train_model(self.local_dataset_directory)
File “/model/bin/uipath_core/training_plugin.py”, line 118, in train_model
self.model.train(directory)
File “/workspace/model/microservice/train.py”, line 36, in train
self.process_data()
File “/workspace/model/microservice/train.py”, line 69, in process_data
self.trainer.preprocess_dataset()
File “”, line 50, in preprocess_dataset
Exception: Dataset preprocess Failed
Is there any way to overcome this?