I have around 100 original PDF doc’s and a few scanned PDF doc’s placed in a folder.
I wanted to know if there is any way we can identify the scanned PDF documents from the folder among other documents using any activity and move these documents into a separate folder using Move activity.
The original PDF documents from the same folder will be read and relevant texts will be extracted to excel ( This is implemented).
I have used a GetFulltext activity to read the Scanned PDF document whose results will be Null or Empty, then I check for all the documents for this Null/empty condition and based on the output have moved them to a different folder for further processing.
Not sure if this is a viable approach to this problem.
Please be aware that this is not necessarily too good of a solution, as you might have mixed pdfs That is, one page is native PDF and does contain some text, the next pages are scanned. You won’t know that using that method.
Depending on what you want to do with them, and if you need this specific information (whether a particular page is scanned or not), I would rather use the Digitize Document activity with a free OCR engine (like Microsoft). The DocumentObjectModel output of the Digitize Document activity contains an array of Pages, and each page has a ProcessingSource which can be either Ocr or Pdf. If it is Ocr - that page was passed through an OCR engine for processing. If it is Pdf - that page was native and read directly.
You might want to use the text output of the Digitize Document activity as well, if you need that output, knowing that the Digitize Document activity only performs OCR if it cannot read text directly from a given page, otherwise it doesn’t call the OCR engine you drag and drop inside.
True ! I have mixed PDFs both scanned and Native and also tagged and untagged. I have tried using the tesseract OCR before and it gave me inaccurate results ( Spelling mistakes, incorrect numbers etc…) so I have decided not to use OCR.
I will try using this Digitize Document activity and will seek your help in case I am stuck.
You might want to try out other OCR engines as well (the Digitize Document activity requires one, in case it needs to use it). Free options besides Tesseract OCR are
Microsoft OCR (available by default in the UiAutomation package)
OmniPage OCR (available in the UiPath.OmniPage.Activities package)
Depending on the quality of the documents, I recommend you play around with them (do try out Profile.None or Profile.Scan for OCR settings, as one of these might work better in your case) and based on your use case decide which works best for you.
If there are multiple pdfs(native as well as scanned but not mixed) in a location and we want to select only scanned files, read with ocr and make them a searchable pdf.
Is there a way to read metadata for each file(without actually opening the file) and verify if its blank/filled to decide whether a doc picked is native or scanned?
And after data is extracted from scanned, to update it back to pdf’s metadata to make it searchable?
If there are multiple pdfs(native as well as scanned but not mixed) in a location and we want to select only scanned files, read with ocr and make them a searchable pdf.
Is there a way to read metadata for each file(without actually opening the file) and verify if its blank/filled to decide whether a doc picked is native or scanned?
And after data is extracted from scanned, to update it back to pdf’s metadata to make it searchable?
Write below c# code to place extracted data from scanned pdf into pdf’s “Keywords” section. Once done, this will make the pdf searchable using the keywords present in pdf’s “keywords” section.
var doc = new Document();
string path = “”;
PdfReader reader = new PdfReader(path+“”);
PdfStamper stamper = new PdfStamper(reader, new FileStream(path+“”, FileMode.Create));
var info = reader.Info;
info[“Keywords”] =pdfText; where pdfText is the variable that holds the data extracted using step1
stamper.MoreInfo = info;
stamper.FormFlattening = true;
stamper.Close();
insertedWordCount = info[“Keywords”].Length;
Also, you will need to import namespace - iTextSharp.text.pdf and iTextSharp.text.xml.xmp