To extract data from msword by document understanding process


By document understanding, can extract data from MSWORD (eg. resumes)

Hi @Prasaanth_S1

The document understanding process is a feature of UiPath that allows you to extract data from unstructured documents such as PDFs, images, and Word documents.

You can use the IntelligentOCR package to extract data from Word documents 5. You can also use Python to extract text from Word documents

If you want to extract metadata from a Word document, you can use the GroupDocs.Parser library

Check out this

Hi @Prasaanth_S1

Yes, document understanding in UiPath can extract data from MS Word documents, such as resumes. You can use the UiPath Document Understanding framework to automatically extract data from structured and unstructured documents, including Word documents.

The first step is to create a machine learning model that can recognize the data you want to extract. This can be done using the Data Manager and Document Understanding ML Trainer in UiPath Studio.

Once you have trained your model, you can use the Document Understanding activities in UiPath to extract data from your Word documents. For example, you can use the “Read Document” activity to read the contents of a Word document, and then use the “Extract Structured Data” activity to extract specific data fields.

You can also use the “Screen Scraping” and “OCR” activities to extract data from unstructured documents, such as scanned resumes in PDF or image format.

1 Like

Hi @Prasaanth_S1 ,

We might not really need to perform the Document Understanding for Word Documents. Although, we might need to convert it to a PDF first to use it with the Digitization. But there are also other workarounds that we can follow after reading the word document into a text but it would be considered as too many steps.

So before really checking on DU approach, we would want to understand what is your goal entirely.

Do you have different format resumes that you would like to extract data from ? Are these formats a limited number or are they can be in any format (Unpredictable) ?

If they are in different formats, then we would need to Label the documents using the Document Manager with many different types/templates of Resume that you would potentially get and then Train the Model with the exported labelled dataset.

There also seems to be a post having the same concern, check below :

1 Like