Extract Text from a PDF document

Hello everyone,

in my robot I want to open a PDF document and get the text of it and copy it so that I can paste it somewhere else. The problem is that I don’t know how to get it. When I try to indicate on the screen it selects the whole document and not the specific text, even if I change the reading options. I know that the “Read PDF text” activity exists but I only need one word and copy/paste it. I can’t share a screenshot from the actual PDF-file because of privacy but it’s kind of like the screenshot I post here (the circle marks the text I want to extract). Do you have any ideas?

Best regards!

@RandomGuy - try CV screen scope and Inside CV extract text will do…

For Cv screen scope …you have to provide the api key from cloud.uipath.com

@prasath17 Thanks for answering!
Unfortunately the CV activities are not allowed in my workplace

@RandomGuy - other option would be Regex…Since you just want to extract the company name(for ex) from the pdf…

Hi @RandomGuy

Try with document understanding feature in Uipath too

1 Like

Hi @NIVED_NAMBIAR … The reason why I didn’t suggested DU is ,user req is to extract one single element from the pdf and DU is very heavy for this…You agree??

Yes but i think it is scanned PDF since @RandomGuy is only able to.select the whole PDF instead of individual elements.

So for scanned PDF DU may be useful I think

1 Like

@NIVED_NAMBIAR Actually it is not a scanned PDF. It is taken from an online web tool.

@RandomGuy - if it’s not scanned and texts are readable then please try to convert the PDF to text format. and share the text file.

Using Regex we can extract the value you are looking for.

1 Like

Is there a way to convert the PDF to text without some online tool or extra software?

@RandomGuy - Yes. First install the below package

image

Then use Read PDF Text activities to read your pdf and output into text file.

1 Like

@prasath17
That’s how it is. I don’t manage to output it into a text file unfortunately.
I thought of using the Create File activity but it doesn’t allow an input.
Sorry for asking so many questions, but I am new to these activities :sweat_smile:

Okay nevermind I gave the Read PDF Text an Output and used in the Write Text File activity.
The question is how I can extract a specific word from the .txt file :thinking:

@RandomGuy - Is it possible to share the .txt file?

or the clear screen shot of the text which you want to extract(showing all the texts above)?

@prasath17
It has data which I would not be allowed to share so I have to mask something but it should be fine I guess. I need the text which is blue highlighted:

@RandomGuy - Please check if this is working…

Assign Activiy StrOrgID = System.Text.RegularExpressions.Regex.Match(YourOutputvariablefromReadPDFtxt,“(?<=Org-ID AG\s*)\w+”).value

StrOrgID = Declare as string variable…

What I have tried…

1 Like

@prasath17
Yes it does! I displayed a message box and it is showing the ID.
Will it work if the PDF file changes a little bit? The PDF is generated once a day and some data changes :sweat_smile:

@RandomGuy - As long as the “Org-ID AG” text remain same, because that is the base to extract the real value…so it should work…watch closely and let me know.

if it changes, we have to identify the variations and try to write the pattern which satisfies all the variation…

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.