I am in a project which I need to read a series of documents scanned in pdf format, the question is that cases arise where the file can be in an erroneous orientation (either horizontal or inverted),
There is some way to determine its orientation and that the robot can read it correctly by OCR
hi @beesheep,
I read the file with the activity Read PDF with OCR,
There are certain words such as name: - for example, that can be identied, but the file is not always the same (there are two types of files or the file may not be any of the 2 and I discard it) and they are scanned.
I have to verify if any of the two files and extract certain data such as name, company, etc.
if there is any other way to identify the file (1 or 2), determine the orientation and extract the data according to the file, it would be very helpful,
Hi @askPWC, you can use the PDFSharp library to accomplish that. I used the below code to convert a .TIF to .PDF. You can easily modify to determine the page orientation.
Dim destinaton As String = in_PDF_Destination
Dim MyImage As System.Drawing.Image = system.Drawing.Image.FromFile(in_TIF_Location)
Dim doc As PdfDocument = New PdfDocument()
For PageIndex As Integer = 0 To MyImage.GetFrameCount(FrameDimension.Page) - 1
MyImage.SelectActiveFrame(FrameDimension.Page, PageIndex)
Dim img As pdfsharp.Drawing.XImage = pdfsharp.Drawing.XImage.FromGdiPlusImage(MyImage)
Dim page As PdfPage = New PdfPage()
If img.Width > img.Height Then
page.Orientation = PageOrientation.Landscape
Else
page.Orientation = PageOrientation.Portrait
End If
doc.Pages.Add(page)
Dim xgr As XGraphics = XGraphics.FromPdfPage(doc.Pages(PageIndex))
xgr.DrawImage(img, 0, 0)
Next
doc.Save(destinaton)
doc.Close()
MyImage.Dispose()
Hi @beesheep
The file is downloaded directly from the company’s website, which is scanned by a third-party provider,
The robot must access the site, download the file and make the corresponding validations and extractions depending on the type of file, but when it comes from a third party and scanned it can come inverted or sideways, here is the inconvenience
@askPWC The code will read a .TIF file (you can replace with .pdf) and get the number of pages in the file. Then, for each page, it will determine the page orientation. Once orientation is determined, it will get the .TIF page and page the image into a .PDF. This is helpful for you because you are dealing with scanned images, and not readable .pdf’s. After it saves each page into the destination variable, it ends.
Is it possible to upload examples of what you need done?
Because it is a project of a company I do not have permission to publish the files, but it is simply a pdf file that is scanned, which can contain several pages.
The question is that there are cases such as for example it may come in a vertical format but it is inverted, that is, the words are pointing up. i just need detect if is inverted, horizontal or is just correct.
Hi Bradsterling,
I came across your code and tried using it but am getting some errors like PDFDocument not defined. Can you please let me know what libraries do I need to import?