Convert PDF to Text, send to array

Thomas_Marzol · July 10, 2019, 9:31pm

Hello, I hope this question is simple.

My goal is to convert the pdf’s to text using google OCR. I would like each page to be just 1 long string and also be its own item in an array for scraping purposes.

current setup:
I have a read PDF using OCR activity. It reads my pdf (multiple pages) and submits it to a variable called strTrial (string variable)

Then I have arrTextFile, a string array of undefined size.

arrTextFile = strTrial.Split(something something something)

The something part is what I am stuck on. I don’t know how to break up a pdf. In theory it’s very simple Each page gets converted to text, then submitted as an item in the array.

I hope I have explained my problem well and anyone can be of help.

Thank you in advance.

Vashisht · July 11, 2019, 4:34am

Hi @Thomas_Marzol,
In that something you can split using the delimiters which are present in the PDF like (:,;,@)
1.First check if there is any delimiter.
2.if there is no delimiter then you can add delimiter in your text using string operations and then use the split operation for required output.
If you find this useful mark it as solution and close the thread.
Cheers
Vashisht.

vickydas · July 11, 2019, 7:39am

Hello @Thomas_Marzol

If you want a whole page to be an item you can read that particular page and than add it to a list variable and read another page till the end and than convert it to array in this case you wont need to split array to get particular page as we are taking data page by page and you can just type the index of the variable to get that particular page
Check this workflow for better understanding
PDF Array.xaml (8.5 KB)

Thomas_Marzol · July 11, 2019, 12:48pm

hello @vickydas , thank you! this is hugely helpful. Just one thing. What packages do you have installed to run this? The first brick of code, within the sequence, within the while statement for me is saying it is missing a package. I have all the common uipath activities installed and have unsuccessfully trouble shot it myself.

vickydas · July 11, 2019, 12:51pm

Hello @Thomas_Marzol
The missing package is PDF

As you have downloaded this workflow so you’ll have to add pdf Package once again to that workflow

Thomas_Marzol · July 11, 2019, 1:33pm

@vickydas Apologies, still running into an error. Very confused! I have the appropriate packages and I suspect this to be some type of bug. Still would deeply appreciate your help if you have time. See attached screenshots of the error compared to my packages.

vickydas · July 12, 2019, 5:02am

Hello @Thomas_Marzol
That missing activity is an Engine used with Read PDF With OCR Activity
Engine
And in the workflow i had used Microsoft OCR The activity should look like this
Engine

Thomas_Marzol · July 12, 2019, 11:52am

@vickydas Thank you! Very helpful. Hopefully this is the last question but how were you able to cast ListofAll as a variable type List I have browsed around the variable types for awhile now and cannot find that as a choice or anything similar to it. Sorry I am a huge newbie at this.

Thomas_Marzol · July 12, 2019, 11:58am

@vickydas also out of curiosity how did you configure the settings of the engine. Is there an optimal choice? I have the result as writing to pdfData variable, but it is throwing an error.

vickydas · July 12, 2019, 1:01pm

Hello @Thomas_Marzol
Click on Browse for types in

And in the search bar type type **System.Collection.Generic.List
Also Select The type you want the list of i have indicated that place with red arrow
imgvar
I had chosen the default setting of that engine but to learn more about OCR and engine watch this link to get a basic idea about OCR used in citrix

Topic		Replies	Views
Very Simple Question about readpdf with OCR and convert to text Help	1	750	August 21, 2019
Split PDF and convert to text file Help	3	1169	August 1, 2019
Convert PDF to array Activities pdf , activities , question	5	738	June 24, 2022
Read multiple PDF's with OCR (any engine) Help error	12	1766	August 5, 2019
Converting Pdf to text File Activities pdf , studio , question , activities_panel	6	227	December 26, 2023

Most Active Users - Yesterday
Anil_G
ashokkarale
Ajay_Mishra
Gautham_Pattabiraman
BHUSHAN_NAGAONKAR1
vrdabberu
ABHIMANYU_THITE1
lrtetala
samantha_shah
shyamala_shyamu
More details...

Convert PDF to Text, send to array

Related Topics