Scanned document converted into PDF vs Original PDF document

Hi team,

Need your help !

I have around 100 original PDF doc’s and a few scanned PDF doc’s placed in a folder.

I wanted to know if there is any way we can identify the scanned PDF documents from the folder among other documents using any activity and move these documents into a separate folder using Move activity.

The original PDF documents from the same folder will be read and relevant texts will be extracted to excel ( This is implemented).

Thank you!

Regards,
Matt

What will be the file name for Scanned and original ?

Thanks
@mc00476004

Hi @hasib08

The name can be anything, we cannot differentiate them with the file name.

Thank you!

I was able to figure out a workaround for this.

I have used a GetFulltext activity to read the Scanned PDF document whose results will be Null or Empty, then I check for all the documents for this Null/empty condition and based on the output have moved them to a different folder for further processing.

Not sure if this is a viable approach to this problem.

1 Like

@mc00476004

Please be aware that this is not necessarily too good of a solution, as you might have mixed pdfs :slight_smile: That is, one page is native PDF and does contain some text, the next pages are scanned. You won’t know that using that method.

Depending on what you want to do with them, and if you need this specific information (whether a particular page is scanned or not), I would rather use the Digitize Document activity with a free OCR engine (like Microsoft). The DocumentObjectModel output of the Digitize Document activity contains an array of Pages, and each page has a ProcessingSource which can be either Ocr or Pdf. If it is Ocr - that page was passed through an OCR engine for processing. If it is Pdf - that page was native and read directly.

You might want to use the text output of the Digitize Document activity as well, if you need that output, knowing that the Digitize Document activity only performs OCR if it cannot read text directly from a given page, otherwise it doesn’t call the OCR engine you drag and drop inside.

What is the exact use case you need to cover?

1 Like

True ! I have mixed PDFs both scanned and Native and also tagged and untagged. I have tried using the tesseract OCR before and it gave me inaccurate results ( Spelling mistakes, incorrect numbers etc…) so I have decided not to use OCR.

I will try using this Digitize Document activity and will seek your help in case I am stuck.

Thank you for your help !

1 Like

You might want to try out other OCR engines as well (the Digitize Document activity requires one, in case it needs to use it). Free options besides Tesseract OCR are

  • Microsoft OCR (available by default in the UiAutomation package)
  • OmniPage OCR (available in the UiPath.OmniPage.Activities package)

Depending on the quality of the documents, I recommend you play around with them (do try out Profile.None or Profile.Scan for OCR settings, as one of these might work better in your case) and based on your use case decide which works best for you.

1 Like

Hi Ioana,

If there are multiple pdfs(native as well as scanned but not mixed) in a location and we want to select only scanned files, read with ocr and make them a searchable pdf.

Is there a way to read metadata for each file(without actually opening the file) and verify if its blank/filled to decide whether a doc picked is native or scanned?
And after data is extracted from scanned, to update it back to pdf’s metadata to make it searchable?

What solution would you suggest?

i have the same question please anyone can help

Hi Ioana,

If there are multiple pdfs(native as well as scanned but not mixed) in a location and we want to select only scanned files, read with ocr and make them a searchable pdf.

Is there a way to read metadata for each file(without actually opening the file) and verify if its blank/filled to decide whether a doc picked is native or scanned?
And after data is extracted from scanned, to update it back to pdf’s metadata to make it searchable?

What solution would you suggest?

Hi @Sarah_Jamaal @mc00476004 ,

We have figured out a solution for this:

  1. Read pdf with OCR
  2. Save extracted data from this activity.
  3. Use invoke code activity.
  4. Write below c# code to place extracted data from scanned pdf into pdf’s “Keywords” section. Once done, this will make the pdf searchable using the keywords present in pdf’s “keywords” section.

var doc = new Document();
string path = “”;
PdfReader reader = new PdfReader(path+“”);
PdfStamper stamper = new PdfStamper(reader, new FileStream(path+“”, FileMode.Create));
var info = reader.Info;
info[“Keywords”] =pdfText; where pdfText is the variable that holds the data extracted using step1
stamper.MoreInfo = info;
stamper.FormFlattening = true;
stamper.Close();
insertedWordCount = info[“Keywords”].Length;

Also, you will need to import namespace - iTextSharp.text.pdf and iTextSharp.text.xml.xmp

Hope this helps.

Regards
Sonali