Scraping specific data from multiple scanned pdf docs and populate them in excel sheet

I have a workflow that needs to:

  1. save all the pdf attachments from automatic emails, and read each attachments (Completed part 1)
  2. scrape the report names, submission date, and form number and populate them in excel sheet.
  3. print the forms out

I’ve been doing it manually every day and now thinking to create a bot that handle it for me.

I’ve used all types of OCR engines and CV activities to scrape the data, workflow works for single pdf but not recognizing the 2nd and 3rd(number 8 is recognized as number 3 since they are scanned pretty badly). PDFs I have are not identical, they are different types of forms, that are scanned and sent by different organizations.

I need to scrape the organization names from each form, but my issue is: for form A: organization name field is located at box 2a., for form B: organization name field is located at box 1a. So the anchors are not fixed and set.

Another issue I’m having is: The report submission date is included in attachment/file name, for example:

ammended F44 incident 121519.pdf
January 2020 F55.pdf
SCC untitled_02122020.pdf

I need to populate the submission data in an excel sheet, but these dates are formatted differently as above, can my bot still recognize it? I’m not an expert in Uipath so please help.

did you try with MATCHES activity

Data scraping from PDF and if not all PDF is of the same format then we just cant achieve 100% by using only UiPath, You can use the integration of ABBYY flexi capture and UiPath and then you can expect a results till 90%. But keep in mind results are always depends on the quality of the PDF file.

1 Like

Yes, sadly, did not work because of the quality of the pdf.

1 Like