Is there a limit to the number of documents or pages that Document Manager can hold?
Issue Description: When attempting to import new documents to a data labeling session the error API_ERROR_IMPORTMAXPAGECOUNTREACHEDERROR is displayed.
Resolution:
Document Manager currently has a 25,000-page limit per data labeling session. The page count must be reduced before new documents can be added to the dataset.
Note: It is recommended to take a backup of the dataset prior to removing documents.
Possible ways to reduce the dataset size while retaining the training data:
Method #1: Permanently delete any soft-deleted documents from the dataset that are no longer needed.
- Below see an example of how documents are soft-deleted and where they would be listed
- Below, see an example of how to "hard-delete" permanently delete documents:
Method #2: Split out the evaluation data in a separate dataset from the training dataset.
Additional Notes:
- Pages are not considered the same as documents. 1 document that is imported, may contain multiple pages.
- A future improvement is planned to clearly display page numbers for the dataset so that there is visibility on how many pages are being consumed.
- For now, to get a general idea of how many pages are currently in the data labeling session, in newer versions of Document Manager, open the Dataset Diagnostic Tab (as shown below).
On this page, a general idea of how many pages are currently in the dataset can be obtained. However, please note that this page count does not include soft-deleted subset (batch) documents or evaluation subset (batch) documents.
See the example below which shows that there are 17709 pages in the current dataset. However, there were also ~4000 pages in the evaluation subset and ~3000 in the soft-deleted subset.
- If the choice is made to download the dataset and then create a new dataset to split up the data to reduce the number of documents/pages, note that manually manipulating the files in the folder after the download is not officially supported. This will often cause corruption in the dataset. To break a large dataset into smaller file sizes for uploading to a new dataset, please see Split Large Datasets For Importing