Get text from PDF, the specific text

Hi all,

I’m having problem with get the text from pdf.
I wanna to get the specific text from the pdf, for example, the name and the IC number (the text in the textbox.
Which method should i use?

image

Regards,
Lean

Can you send the sample pdf file @lyjun550

@lyjun550 If the format is same always, then try using Read PDF Text Activity and get the text in the PDF as a String, Output the String in a Message Box. We might be able to apply Regex for it to Extract the details needed.

3 Likes

This is the output from read PDF Text.
The name and IC No, are not with the same location.
image
image

Sorry, i dont have example…

image
for the payee name and purpose also moved, and the tick cant be read.

How regax work?

Click the icon below in the Output panel.
image
Save to desktop and upload.
De-identify the request if needed.

Copy contents and paste them here :slight_smile:

We can then do our best to assist.

output.txt (1.1 KB)
here

Can you please bold what you are trying to obtain :slight_smile: .

07/14/2020 14:44:39 => [Debug] Execution started for file: test
07/14/2020 14:44:44 => [Info] Extract PDF execution started
07/14/2020 14:44:48 => [Info] Authorization For Salary Deduction

Date:

The Human Resource Officer

Tokio Marine Life Insurance Malaysia Bhd. (457556-X)

Menara Tokio Marine Life, Ground Floor,

189, Jalan Tun Razak,

50400 Kuala Lumpur.

Dear Sir,
Abu Bakar 010101000011
I , (I/C No ) ,

hereby authorise TOKIO MARINE LIFE INSURANCE MALAYSIA BHD. to deduct the sum of
RM 9999 from my salary and remit on my behalf to the following:

Payee’s Name
Chin Chan

Purpose
1231414

Salary deduction

4 October
Once; Salary deduction only for the month of __________________

Please Select
Recurring; Salary deduction effective from the month of ________________ and this authorization will
remain in force until and unless revoked by me in writing.

Thank you

Yours truly,

Employee No : 1

Dept / Branch : Information Technology

@Steven_McKeering
Date:

The Human Resource Officer

Tokio Marine Life Insurance Malaysia Bhd. (457556-X)

Menara Tokio Marine Life, Ground Floor,

189, Jalan Tun Razak,

50400 Kuala Lumpur.

Dear Sir,
Abu Bakar 010101000011
I , (I/C No ) ,

hereby authorise TOKIO MARINE LIFE INSURANCE MALAYSIA BHD. to deduct the sum of
RM 9999 from my salary and remit on my behalf to the following:

Payee’s Name
Chin Chan

Purpose
1231414

Salary deduction

4 October
Once; Salary deduction only for the month of __________________

Please Select
Recurring; Salary deduction effective from the month of ________________ and this authorization will
remain in force until and unless revoked by me in writing.

Thank you

Yours truly,

Employee No : 1

Dept / Branch : Information Technology

Date : 12 12 12


The tick box also, but its not text it read as 4, also i mean it have to get “Month” of the ticked box area

The Abu Bakar and the numbers have to seperate

Hello

Please provide the text result for the tick box and the month sample.

The other results are below. Insert them into a Matches activity.

To get the digits after RM:
(?<=RM)\s+(\d+)
Regex101 link
Must be digits only.
image
Then use an assign activity with the following to get group 1.
INSERTVARIABLE(0).Groups(1).Tostring

Payee’s Name:
(?<=Payee.s Name)\s+\n(.*)
You will need use group 1 to get the result.
image
Regex101 link

Purpose
(?<=Purpose)\s+\n(.*)
Get group 1 again.
Regex101 link

4 October - Double check this one.
\d+\s\w+\n(?=Once;)
Regex101 link
This will work as long as the next line starts with “Once;”

Dept / Branch
(?<=Dept / Branch)\s+:\s+(.*)
Get Group 1 for this one.
Regex 101 link

Date
(?<=Date)\s+:\s+(\d+\s\d+\s\d+)
Get group 1 for the result.
Regex101 link
image

Employee No
(?<=Employee No)\s+:\s+(\d+)
Get group 1 for the result.
Regex101 link

Abu Bakar AND 010101000011
(?<=Dear Sir.)\s+\n\s*([\D]+)\s(\d+)
Get group 1 for Abu Bakar
Regex101 link
image

To get the number 010101000011
Regex pattern: (?<=Dear Sir.)\s+\n\s*([\D]+)\s(\d+)
Use group 2 for 010101000011

If this helped, please marked as solved :smiley:

Thanks! but how to seperate the group 1 and group 2 like Abu Bakar and 01010101


im having with this error, what’s the problem?

You will need to clean the string first.

UiPath thinks there is invisible characters…

System.Text.RegularExpressions.Regex.Replace(INSERTVARIABLE, “[^a-z A-Z 0-9]”, “”)

So,

I had to clean the string the invisible/illegal characters before UiPath would like it :slight_smile:

Workflow attached.
Main.xaml (15.4 KB)

If this helped, please mark as solved.

It works, thanks!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.