Scanned document converted into PDF vs Original PDF document

mc00476004 · January 17, 2020, 1:01am

Hi team,

Need your help !

I have around 100 original PDF doc’s and a few scanned PDF doc’s placed in a folder.

I wanted to know if there is any way we can identify the scanned PDF documents from the folder among other documents using any activity and move these documents into a separate folder using Move activity.

The original PDF documents from the same folder will be read and relevant texts will be extracted to excel ( This is implemented).

Thank you!

Regards,
Matt

hasib08 · January 17, 2020, 5:08am

What will be the file name for Scanned and original ?

Thanks
@mc00476004

mc00476004 · January 17, 2020, 5:48am

Hi @hasib08

The name can be anything, we cannot differentiate them with the file name.

Thank you!

mc00476004 · January 17, 2020, 7:21am

I was able to figure out a workaround for this.

I have used a GetFulltext activity to read the Scanned PDF document whose results will be Null or Empty, then I check for all the documents for this Null/empty condition and based on the output have moved them to a different folder for further processing.

Not sure if this is a viable approach to this problem.

Ioana_Gligan · January 21, 2020, 2:20pm

@mc00476004

Please be aware that this is not necessarily too good of a solution, as you might have mixed pdfs That is, one page is native PDF and does contain some text, the next pages are scanned. You won’t know that using that method.

Depending on what you want to do with them, and if you need this specific information (whether a particular page is scanned or not), I would rather use the Digitize Document activity with a free OCR engine (like Microsoft). The DocumentObjectModel output of the Digitize Document activity contains an array of Pages, and each page has a ProcessingSource which can be either Ocr or Pdf. If it is Ocr - that page was passed through an OCR engine for processing. If it is Pdf - that page was native and read directly.

You might want to use the text output of the Digitize Document activity as well, if you need that output, knowing that the Digitize Document activity only performs OCR if it cannot read text directly from a given page, otherwise it doesn’t call the OCR engine you drag and drop inside.

What is the exact use case you need to cover?

mc00476004 · January 22, 2020, 12:59am

True ! I have mixed PDFs both scanned and Native and also tagged and untagged. I have tried using the tesseract OCR before and it gave me inaccurate results ( Spelling mistakes, incorrect numbers etc…) so I have decided not to use OCR.

I will try using this Digitize Document activity and will seek your help in case I am stuck.

Thank you for your help !

Ioana_Gligan · January 22, 2020, 7:08am

You might want to try out other OCR engines as well (the Digitize Document activity requires one, in case it needs to use it). Free options besides Tesseract OCR are

Microsoft OCR (available by default in the UiAutomation package)
OmniPage OCR (available in the UiPath.OmniPage.Activities package)

Depending on the quality of the documents, I recommend you play around with them (do try out Profile.None or Profile.Scan for OCR settings, as one of these might work better in your case) and based on your use case decide which works best for you.

sonaliaggarwal47 · February 25, 2021, 7:41pm

Hi Ioana,

If there are multiple pdfs(native as well as scanned but not mixed) in a location and we want to select only scanned files, read with ocr and make them a searchable pdf.

Is there a way to read metadata for each file(without actually opening the file) and verify if its blank/filled to decide whether a doc picked is native or scanned?
And after data is extracted from scanned, to update it back to pdf’s metadata to make it searchable?

What solution would you suggest?

Sarah_Jamaal · March 31, 2021, 2:47pm

i have the same question please anyone can help

Sarah_Jamaal · March 31, 2021, 2:49pm

Hi Ioana,

If there are multiple pdfs(native as well as scanned but not mixed) in a location and we want to select only scanned files, read with ocr and make them a searchable pdf.

Is there a way to read metadata for each file(without actually opening the file) and verify if its blank/filled to decide whether a doc picked is native or scanned?
And after data is extracted from scanned, to update it back to pdf’s metadata to make it searchable?

What solution would you suggest?

sonaliaggarwal47 · April 19, 2021, 6:18pm

Hi @Sarah_Jamaal @mc00476004 ,

We have figured out a solution for this:

Read pdf with OCR
Save extracted data from this activity.
Use invoke code activity.
Write below c# code to place extracted data from scanned pdf into pdf’s “Keywords” section. Once done, this will make the pdf searchable using the keywords present in pdf’s “keywords” section.

var doc = new Document();
string path = “”;
PdfReader reader = new PdfReader(path+“”);
PdfStamper stamper = new PdfStamper(reader, new FileStream(path+“”, FileMode.Create));
var info = reader.Info;
info[“Keywords”] =pdfText; where pdfText is the variable that holds the data extracted using step1
stamper.MoreInfo = info;
stamper.FormFlattening = true;
stamper.Close();
insertedWordCount = info[“Keywords”].Length;

Also, you will need to import namespace - iTextSharp.text.pdf and iTextSharp.text.xml.xmp

Hope this helps.

Regards
Sonali

Topic		Replies	Views
Read PDF Question Activities pdf , activities , question	3	359	July 21, 2023
Differentiate PDF files Activities pdf , activities , question	2	743	August 19, 2021
RE: PDF Scanned Help	6	769	August 5, 2019
PDF extraction from multiple pdf and how to check which pdf is scanned and which pdf is regular Activities pdf , activities	10	1735	March 10, 2022
Read PDF with OCR Academy Feedback	5	6237	January 29, 2020

Scanned document converted into PDF vs Original PDF document

Related topics