My goal is to convert the pdf’s to text using google OCR. I would like each page to be just 1 long string and also be its own item in an array for scraping purposes.
current setup:
I have a read PDF using OCR activity. It reads my pdf (multiple pages) and submits it to a variable called strTrial (string variable)
Then I have arrTextFile, a string array of undefined size.
The something part is what I am stuck on. I don’t know how to break up a pdf. In theory it’s very simple Each page gets converted to text, then submitted as an item in the array.
I hope I have explained my problem well and anyone can be of help.
Hi @Thomas_Marzol,
In that something you can split using the delimiters which are present in the PDF like (:,;,@)
1.First check if there is any delimiter.
2.if there is no delimiter then you can add delimiter in your text using string operations and then use the split operation for required output.
If you find this useful mark it as solution and close the thread.
Cheers
Vashisht.
If you want a whole page to be an item you can read that particular page and than add it to a list variable and read another page till the end and than convert it to array in this case you wont need to split array to get particular page as we are taking data page by page and you can just type the index of the variable to get that particular page
Check this workflow for better understanding PDF Array.xaml (8.5 KB)
hello @vickydas , thank you! this is hugely helpful. Just one thing. What packages do you have installed to run this? The first brick of code, within the sequence, within the while statement for me is saying it is missing a package. I have all the common uipath activities installed and have unsuccessfully trouble shot it myself.
@vickydas Apologies, still running into an error. Very confused! I have the appropriate packages and I suspect this to be some type of bug. Still would deeply appreciate your help if you have time. See attached screenshots of the error compared to my packages.
@vickydas Thank you! Very helpful. Hopefully this is the last question but how were you able to cast ListofAll as a variable type List I have browsed around the variable types for awhile now and cannot find that as a choice or anything similar to it. Sorry I am a huge newbie at this.
@vickydas also out of curiosity how did you configure the settings of the engine. Is there an optimal choice? I have the result as writing to pdfData variable, but it is throwing an error.