How To Read Multiple PDF Files and Output Text to Multiple files

Hello, Thank you in advance for the assistance. I am very new to this so i am still trying to learn and understand as much as i can. I currently have a workflow with a few Sequence. One is to retrieve email attachments from outlook and save it into a directory. That is working.
The other sequence is what i am struggling with. This is suppose to access the path where all the pdf files are stored and then loop through each file and OCR them. Then save the result into a text file with the same name as the original attachment.(ideally i would like to store the OCRed data into a table with columns with matching regex but right now i am not even sure how to properly loop through each file and OCR it, then save it. Please assist.

This looks ok, the the pdf must contain text for this to extract the contents, if they are like images then your activity must be Read PDF Text With OCR instead.
And if you only need help to save the text file with the same name just change “output.txt” to item.Replace(“pdf”,“txt”)

@hatakora

Here are some steps which will help you

  1. Read all the files within the folder using Directory.GetFIles(“Path”) and store in a variable of type string array
  2. use a for each loop to loop through all the files within that and you will get each file name
  3. Then you need to use read pdf activity if it has text or else with OCR when it has image contents (as you want it as text, I hope you have only text in your files)
  4. When you split each item of for each loop using string manipulations, you will get the file name so you can use the name to save the text file

string manipulations like (item.split("/").last()).split(".").First()

@HareeshMR Thanks for the reply. I understand everything you said up until the number 4. The way my workflow is setup is pretty much the same thing you described from 1 to 3. My problem is when i loop through the files (see attached screenshot) how do i then output each pdf file that was read into a new txt file?

@bcorrea Thanks for the reply. I understand everything you said up until the number 4. The way my workflow is setup is pretty much the same thing you described from 1 to 3. My problem is when i loop through the files (see attached screenshot) how do i then output each pdf file that was read into a new txt file?

Read PDF has a text output that you should pass a variable you created, i assumed that you also called it ‘output’ and used to write the text file…

@bcorrea Yes Sir i did. Ideally the goal here to output the text read from each pdf file into a data table. So i have made some progress and this is what i have now but i am running into an error. Please see attached Sir. My workflow and error

Why you are using add data row here @hatakora?

As in the first screenshot, you are doing everything correct there, you just need to get the filename from the variable item and then you need to pass it to the write text activity . That’s the only issue I see

That error maybe come if you are not using the same datatable you created on Build DataTable or maybe because of something we cant see inside the Add Data Row activity.

@HareeshMR I think i am following what you’re saying but my loop is failing. If i test it with one pdf file, it works but the second i use a “for each” to check every pdf file in the path i get the attached error.40%20PM
If it is not too much, can you show me ahow you’d design the same workflow Sir? Thank you.

maybe you have files in that folder that are not pdf files?

You were very much right Sir , @bcorrea i happen to have another folder in there. This is very good to know. So the loop is working now. I actually added a “Start Process” activity to see if i can open each PDF file. But somehow it is failing to open the file and then succeeding. So i have 4 pdf files in the folder and what is happening is that it will fail to open all four files at first and then succeed after all 4 failures. One thing i am noticing about the first failed attempts is that the file names are different. It is adding a character in front of it for some reason.
Example:

File name: invoice.pdf
Failed File: _invoice.pdf then it succeed in opening invoice.pdf

Any idea why this may be happening? I have tried a few thing like specifying the item name and the string : item.ToString , “”""+item.ToString+"""", "“Item.ToString”, including item works

First thing you can change your getfiles to be like this: Directory.GetFiles(“c:\folder”, “*.pdf”) so it wont get anything that is not a pdf.

In your Start Process you can put just item and not “”+item.TosString+""

@bcorrea When i put just item the same issue occurs. That is what led me in the first place to trying “”+item.TosString+""

oh you say the file names and folder have " " spaces so you need an " before and after? so you need to use 3 “”" before and after? well your code cannot be putting that _ in the name, is hard to tell were it is coming from if really not in the folder… do you have something hidden there?

The file names have special characters in them tho. Could that be the issues? One of my file names is "PO#00563460 and other ones have a dot “.” in them or an underscore at the end “_”

ok, try this:

Chr(34) & item & Chr(34)

@bcorrea Something very weird is happening where it attempting to open some files that don’t even exist in the directory. Files that i have deleted. It attempts to open them with the wrong names and fails then properly opens up the intended files. I have restarted UIPath, created an entirely new project with new variables.
It is still occuring

Here is an example. There is only two files in the directory. But when ran it attempts to open a file that doesn’t even exist anymore.

my guess is that you may have some hidden/system files in there, can you enable this in here:

@bcorrea I am so stupid. I figured out what the issues is. I am working on a MAC and i am running UIPath in a windows machine in parallels. So the path that i was using is the windows one. That is what was causing the issues. Even Though it is in windows, the files are still saved on the mac. So once i changed the path to reflect the macOS folder structure. It worked like a charm.

1 Like