Extracting data from PDF to create a filename

I have tried everything. … I need to extract a name from several PDF files and use the extracted name as the filename for the document.

I have seen the pdf extraction video several times (10 times to be exact) and tried all the options… Get Text, Anchor Base, Read OCR … and still not getting the information to name the file.

So Simply put…I have over 100 documents (that are the exact same) and I want to pull a name from each document and save the document with that extracted name. Is this possible?

@Renia,

Did you try reading the pdf file by using read pdf activity and extracting the file name by using regex?

Is it possible to provide the sample PDf file?

best,
Sid

1 Like

I am new…brand new to UIPath! I tried reading the pdf but got no results in my message box. Not sure what regex is? but I have attached a sample pdf. I want the SAMPLE COMPANY 101 to be extracted and used to rename the document.

I tried to upload the pdf but it says new users cannot upload documents so I can provide a screenshot of the pdf.

@Renia,

no worries. Regex means regular expression which can be used to extract the particular values from a string.

Can you please provide the steps you followed working on this solution? And also can you check whether the PDF file is image or not?

Sid

Hi Renia,

Try as below:

  1. Use Read PDF Text activity to get all contents
  2. Use Generate Data Table activity to move all PDF text to the data table.
  3. If your PDF is in a fixed format, the name should be in the same cell in the data table.

Can you provide sample pdf of the same.

Is “company” is the default one?

@Renia if possible send sample pdf so that we’ll try.

If you cannot share the PDF with us, do this.
Install PDF Activities from the Package Manager, after that use Read PDF activity, and paste the extracted value in a file.

Then copy us the part around the information that you want.
After that we can use Regex and String manipulation to extract that information.

If the files are the same, this can be easily automated.

So I have been successful in with Read PDF with OCR and sending the text to a message box that shows the full extraction of the pdf.

So now…how do I scrape the date I want for the file name and name the file…for several of these? I am no longer a newbie so I think I can attach the file now.sample PDF FOR EXTRACTION.pdf (110.4 KB)

The steps that i have done so far are:

Created a Sequence
Added “Read PDF with OCR”
Selected file to read (the attached)
Used the Tesseract OCR
Added Message Box
Created the variable pdftext

I got it to read the PDF, still not sure what you mean by Regex, that is exactly what I want is pull the name from the pdf. I can upload the pdf now … I am no longer a "newbie"sample PDF FOR EXTRACTION.pdf (110.4 KB)

@Renia,

once you read the data from PDF file you can use the split function or substring function( little more manipulation requires for this function) and extract the name as shown below.

Attaching the xaml for your reference.

All the best,
SidMain.xaml (6.2 KB)

Hello

You responded to my question back in August and I am revisiting this process again after finishing the classes and still not able to make this work. I have attached a copy of the pdf and I can extract all the text to pdf, but the only thing I want out of the file is at the very bottom where it says CLIENTID.002 (on the very last line)

I am not sure how to use Regex to get the string. I have over 100 files I need to rename based on whatever the ClientID.00X is

Can you help me?sample PDF FOR EXTRACTION.pdf (110.4 KB)

PDFExtraction.zip (119.7 KB)

This one is rather easy.
You should take time and learn Regex since it will help you so much to be a better programmer in general (not just for RPA).

Cheers :slight_smile:

Actually the pattern should be “CLIENTID.\d+” not the one that I’ve put, it is safer

Thank you this is great. How can I now store this variable to rename the pdf using this extracted text. And how can I do this for a folder that includes 100 pdf files

First of all, you need a For Each activity that will loop through all files in the folder.
You can read all the name of the files and store them in a String Array with Assign activity:

StrArr = Directory.GetFiles(“yourFolderPath”,"*.PDF")

(Don’t forget to change the object in For Each to String)

After that, you read the Item that you are iterating (let’s call this item File).
Use Read PDF Activity and read the File, store that file in a variable, let’s call it WholeText

After that, use Matches activity like I’ve shown you (the input being WholeText, pattern being the one above, and output is RegexResults Enumerable).

You can use Assign activity to store the value of the extracted text :

NewFileName = RegexResults(0).Value.

You now have the newFileName

Do this part, show me your XAML file, then we will discuss actually changing the file name :slight_smile:

Thank you for your assistance. I have completed the steps above. Don’t really see anything when I run it.

I have attached my XAML file and ready for the next step.Main.xaml (5.9 KB)

1 Like

Sorry was away, going to look at Xaml file now.

Can you attach whole ZIP file, because I am having trouble opening this project version

Edit:
Nvm, here you go:

Donna.zip (118.6 KB)

I got the zip file…

In the Assign I added the filepath, of where the new files should be saved…

In the variable, I added the S drive location of where the files are I want to rename.

I am still getting an error in the Assign spot. What am I doing wrong? I get the attached Runtime Execution error.

10-7

Main.xaml (6.6 KB)