Extract data from pdf and word docs and store it in an excel sheet

I need to extract names and email ids of 14 employees’ resumes. Out of 14 resumes, 10 resumes are in pdf format and the remaining 4 are in word document. After extracting names and email ids, it should be stored in an excel sheet. Can anyone pls explain how to create a bot that extracts emp names and email ids and stores it in an excel sheet? I need the solution as earlier as possible.

Here is a step-by-step process for extracting candidate names and email addresses from resumes and storing them in an Excel file.

  1. Use Load taxonomy activity and configure it in Taxonomy manager
  2. use Digitize document actvity and give document/resume path and its better to store it in a variable.
  3. Use classify document activity and configure it with the name of document you created in TAXONOMY MANAGER.
  4. Use Data Extraction scope .
  5. Use Present Validation Station activity
  6. Use Export Extraction result activity and assign it to dataset.
  7. Then use Write Range Workbook activity within for each loop

Your data will be extracted from resume and an excel file will be written.

In case you need clarity on the activitie’s configuration, kindly share a sample document and i can create a sample application for it.

Hope you find this helpful

Thank you for your reply. As all the resumes are confidential, I can’t share the resumes. Can you pls share how to create a bot. I need some clarity. You’ve sent the steps but still I couldn’t understand some steps. Pls help me in this regard.

I understand your concern, i will share my process activity so that you can get to see how to configure activities.

I am unable to upload my project here. I am sharing a video link so that you can get a clear idea about it.

For better and details understanding DU , Refer to below given YouTube video

Main.xaml (27.5 KB)
Please refer the step for DU process

Hi @AMAK ,

We could provide better help if you could share with us the format of the Resumes along with info of whether a Scanned documents are a possibility or not, additionally we would want to know if it is the same format that is going to be followed throughout or is it going to be different for different resumes.

Understanding the above would help us give you the suggestion of whether to approach the Extraction intelligently or we could derive rules and perform Either Regex/String/Word based extraction.

Let us know if you provide us with additional details or a Sample files to work on (Not Confidential files).

Hi @AMAK

I believe the resumes you have follow different layouts.
In addition, for word documents, I think we can easily get those converted into PDF files using the UiPath.PDF activities. This way, we are only processing PDF files and no need to worry about different file formats.

We can check for the extension, and if it is word (doc, docx), we could do the conversion.

In case your resumes follow a different layout, I would suggest to go with UiPath Document Understanding. You will need to use the “Document Understanding” generic model available in the AI Center Under out-of-the-box packages for DU. This model is not trained. So use the Document Manager to upload some resumes, define the fields you want to extract, and label those. This way you generate the training data.

Once done, you can run the training pipeline to train the model and use it in your workflow.

You can also refer to this video list in case you wanna know more…

Hope this helps

can you pls explain me in detail. I’m new to UiPath. multiple pdfs and word documents. I need it within tomorrow. Anyone who knows the procedure pls explain. I’ve tried to extract names and email ids in many ways still couldn’t able to solve the problem.