I have a folder name uipath in c:\uipath directory. I used string[] pdfList =Directory.GetFiles(“C:\uipath”) to get the pdf files. Then later I used For Each activity to loop the files.
The pdf file contains image text. I used tesseract OCR activity to extract the data. But now I want to extract specified data from it.
Then later I need to again extract specified data from another pdf. like (name, address, designation).
All the data is in image format in pdf file.
Then later I need to send this data into a csv file. So I’m stuck in the middle of this. how should I do this process?
you can use string manipulation operations to extract specific value. (example- substring,split, index etc)
Or
To get values from a image - you can also use AI.ComputerVision activities.
Did i say that he wouldnt be using ocr for scraping tho? i just said is not always the best choice to use Read PDF with OCR activity if you only need small pieces of structured pdf files, computer vision (per example) does have ocr behind the scenes for things that cant use selectors, but is a lot easier to use…
@bcorrea but how do you know where to scrape on the screen? Unless you know the page number i’m not sure how would help. And if you know the page number then you can include just that page as one of the properties of the read pdf activity. As a general rule at my company we still aren’t using OCR due to accuracy issues in critical applications, so I don’t have a ton of practice with it, so i could definitely be wrong
Usually people dealing with image pdfs are single pages or structured stuff, so the know where things are, but yes the process would have to navigate the pdf to be visible before scarping… Computer Vision is quite new, give it a go when you can, it is fun to see what it does.