Convert PDF to Text, send to array

Hello, I hope this question is simple.

My goal is to convert the pdf’s to text using google OCR. I would like each page to be just 1 long string and also be its own item in an array for scraping purposes.

current setup:
I have a read PDF using OCR activity. It reads my pdf (multiple pages) and submits it to a variable called strTrial (string variable)

Then I have arrTextFile, a string array of undefined size.

arrTextFile = strTrial.Split(something something something)

The something part is what I am stuck on. I don’t know how to break up a pdf. In theory it’s very simple Each page gets converted to text, then submitted as an item in the array.

I hope I have explained my problem well and anyone can be of help.

Thank you in advance.

Hi @Thomas_Marzol,
In that something you can split using the delimiters which are present in the PDF like (:,;,@)
1.First check if there is any delimiter.
2.if there is no delimiter then you can add delimiter in your text using string operations and then use the split operation for required output.
If you find this useful mark it as solution and close the thread.
Cheers
Vashisht.

1 Like

Hello @Thomas_Marzol

If you want a whole page to be an item you can read that particular page and than add it to a list variable and read another page till the end and than convert it to array in this case you wont need to split array to get particular page as we are taking data page by page and you can just type the index of the variable to get that particular page
Check this workflow for better understanding
PDF Array.xaml (8.5 KB)

1 Like

hello @vickydas , thank you! this is hugely helpful. Just one thing. What packages do you have installed to run this? The first brick of code, within the sequence, within the while statement for me is saying it is missing a package. I have all the common uipath activities installed and have unsuccessfully trouble shot it myself.

1 Like

Hello @Thomas_Marzol
The missing package is PDF


As you have downloaded this workflow so you’ll have to add pdf Package once again to that workflow

1 Like

@vickydas Apologies, still running into an error. Very confused! I have the appropriate packages and I suspect this to be some type of bug. Still would deeply appreciate your help if you have time. See attached screenshots of the error compared to my packages.


error

1 Like

Hello @Thomas_Marzol
That missing activity is an Engine used with Read PDF With OCR Activity
Engine
And in the workflow i had used Microsoft OCR The activity should look like this
Engine

1 Like

@vickydas Thank you! Very helpful. Hopefully this is the last question but how were you able to cast ListofAll as a variable type List I have browsed around the variable types for awhile now and cannot find that as a choice or anything similar to it. Sorry I am a huge newbie at this.

@vickydas also out of curiosity how did you configure the settings of the engine. Is there an optimal choice? I have the result as writing to pdfData variable, but it is throwing an error.

Hello @Thomas_Marzol
Click on Browse for types in
image
And in the search bar type type **System.Collection.Generic.List
Also Select The type you want the list of i have indicated that place with red arrow
imgvar
I had chosen the default setting of that engine but to learn more about OCR and engine watch this link to get a basic idea about OCR used in citrix