Handling landscape view in a pdf and extracting data from a pdf

Hi,

It would be great if you could help me out for the below scenarios.

  1. Is there an intelligence to check if a pdf page is in portrait view or landscape view. In a pdf, some pages are in portrait view and some in landscape view. I need to read the text in that pdf using OCR. Any suggestions?
  2. I tried to extract text from a structured pdf document. I need the text from all the pages - Tabular and non-tablur formatted text. Below are the options i tried but it doesnt help. Let me know if we can achieve this by any other ways.
    2a) “Read pdf with OCR” (With choosing inverted option and without choosing inverted options were tried) - Returns empty result.
    2b) Read Pdf text - output is empty
    2c) Scraping helps. But how do we know the number of pages and how to extract text from all the pages?
  3. I am trying to extract text from a pdf and trying to move it to another folder. But it says “The process cannot access the file because it is being used by other process”. How do we resolve it? The document is not open anywhere else.

Hi ,

I am looking for the answers on this as well, were you able to figure out something?
thanks

Below are the solutions that i used

  1. UiPath does not have the intelligence to check if a page is in portrait or landscape mode. However, some of the OCR auto rotate the pages to extract the data
  2. Read PDF with OCR works finally
  3. I copied the file to destination folder and then deleted the file after processing

Can you please share the OCR which supports auto rotation.

1 Like

Hello
can you please explain about OCR you used to extract data from landscape view.
I am also stuck in the same situation can you please help me out @lissynikkytha
I too used read pdf with ocr its giving correct result for all the pages except for rotated one.

Try with Abbyy OCR

I assume u extract pdf by OCR page per page. I suggest u to add more logic, this logic will rotate automatically until extracted data is readable.

hope is work.

Hello @lissynikkytha, @Disha_Jain, @arvind8pandey, @Priya_Pandey, and @irahmat,

  1. to get page rotation and skew angle, please use the Digitize Document activity from the IntelligentOCR 3 activity package. It exposes this information on a page by page basis in the DocumentObjectModel output. Please feel free to navigate through the output (you can do it using the newest debug features in Studio directly) to see where to grab that information from.
  2. Data extraction - I recommend building your own custom activity for data extraction or trying to use the newly released Regex Based Extractor - this applies whatever regex expressions you configure for certain fields, to the Text version of the document fed into the Data Extraction Scope.

To get you started, you might want to check out this: How to use the IntelligentOCR Package

3 Likes

Your answer is very helpful for me. Thank you.
I found something about abbyy finereader or flexicapture. (I did not understand which product will be great for me yet.)
I have hundreds of pdf files which may be portrait or landscape. (Some pdf’s may be first page portrait other pages landscape.) So, If you know something about abbyy, which product can be implement on UIPath successfully?

I found a connector plugin for connecting abby and uipath. But I think it works just abbyy flexicapture not with finereader. In this point, I need to read all the pdf file and take all the text data to dom. So flexicapture works with fields. But the finereader works with entire pdf. Which one should I use do you know?

Thanks.

is not working, I tested a few scenarios of page rotation examples and is showing the following:

image

Rotation: None
SkewAngle=0

Even though the page is obviously rotated
Is it possible to change these values in the object model and adjust the pages values ?