Need help with regex to extract from pdf

Ather_Alam_Khan · June 25, 2020, 7:38am

Dear All,

I am actually trying to extract data from a pdf bank statement to move into a excel sheet. I used “Read PDF Text” activity and “Write Text” activity to extract text into a text file.

Now the output is like:

BRANCH NAME
XXXXXXXXX Branch City

A/C TYPE A/C NUMBER
XXXXX XXXXX XXXXXXX Demand Deposits X-X-XX-XXXXX-XXX-XXXXX
HOUSE#X-XXX XXXXX-4-X XXXXXXXX XXXXX XXXXX
IQBAL, Karachi, Pakistan XX97XXXX0123456789012345

CURRENCY FROM TO PRINTED ON
PKR 01-Apr-2020 30-Apr-2020 07-May-2020

Date Particulars Debit Credit Balance

Opening Balance 00,500.00
07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:07PM 00,000.00 00,500.00
#1269297
07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:08PM 58,000.00 28,536.00
#1269297
07-Apr-2020 Web Fund Transfer to Bank (SCB), Acc: XXXXX 0,500.00 28,036.00
XXXX XXXXX (XXXXXXXXXXXXX),DT:Apr 7 2020 ,003621
07-Apr-2020 Web Fund Transfer to Bank (XXXXXXXX), Acc: XXXXX 0,500.00 00,500.00
XXXX XXXX (XXXXXXXXXXXX),DT:Apr 7 2020 ,003033
EFT Rec from Bank (XXXXX) Account
07-Apr-2020 (XXXXXXXXXXXXX),Sender (- 00,000.00 00,500.00
),DT:0407,123204Remitter msg :Miscellane

Now I need to extract the credit and debit transaction details into an excel sheet into different columns and rows. Since I am very new to regex and have tried creating patterns on https://regexr.com/ but no luck so far. Any help would be highly appreciated.

Thank you
Ather

schwarzp · June 25, 2020, 7:55am

can you please tell me what are the specific details you want from the credit and debit transaction details?

Ather_Alam_Khan · June 25, 2020, 8:06am

so this text is full of many things, like the branch name on top, account type, account number, name address etc

the line " 07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:07PM 00,000.00 00,500.00
#1269297" states [date of transaction] [narration or details of transaction with date and time] [debited amount] [credit] [balance]

The lines above tell two things
[Date] [Particulars] [Debit] [Credit] [Balance]
then [opening balance]

schwarzp · June 25, 2020, 8:45am

at first i would try to split the whole string into substrings. e.g. that your regex only searches in the string " 07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:07PM 00,000.00 00,500.00"

for splitting it i would try splitting by line breaks.

Afterwards you can just loop through the array and grab the information via regex. e.g. (\d{2}-\w{3}-\d{4}) would grab the first date of the string (07-Apr-2020).

Steven_McKeering · June 25, 2020, 10:54am

I am still not 100% sure what you are looking for?
Can you bold the output you want in your original post?
Can you tell me more about the patterns of text also.

Ather_Alam_Khan · June 25, 2020, 2:19pm

I am actually trying convert a pdf bank statement into excel and you understand a bank statement is always full several details like customer name, address, account number, account type, IBAN, statement from (date), statement to (date), printed on (date) and then it comes to transaction details (debit and credit)

Steven_McKeering · June 26, 2020, 12:14am

Hello

I am not 100% certain what you need to capture but I think I understand your requirement.
Have a look at the below solutions:
Regex Solution #1:
(\d\d-\D\D\D-20\d\d)\s(.*)\s(\w+\s\d+)(?=, ).\s(\D\D\D\s\d+\s20\d+)\s(\d+:\d+PM|\d+:\d+AM)\s([0-9,.]+)\s([0-9,.]+)\s(#\d+)

Check out the Regex 101 link

Screenshots:

To write out each variable use an Assign activity with the following:
Match 1:

INSERTVARIABLE(0).Groups(1).ToString

Match 2:

INSERTVARIABLE(0).Groups(2).ToString

Match 3:

INSERTVARIABLE(0).Groups(3).ToString

Just update the capital letters with your output variable from the Matches activity.

Regex Solution #2:
Comments: If this wording (ATM Withdrawal FROM 060137) should be all one group (see red highlights) then try the below Regex Solution. You could also add group 2 + 3 together from the above solution:
(\d\d-\D\D\D-20\d\d)\s(.*)(?=, ).\s(\D\D\D\s\d+\s20\d+)\s(\d+:\d+PM|\d+:\d+AM)\s([0-9,.]+)\s([0-9,.]+)\s(#\d+)

Check out the Regex101 Link

SamanGuruge · June 26, 2020, 1:58am

Hi @Ather_Alam_Khan,

I’m also working on a Bank reconciliation project and I got several bank pdf like you mention in the post. you need to extract the table form your pdf right. Try to build CSV from your pdf file and parse that into Generate Data table activity with your delimiter. then it will give you the table and you can write that data into excel as the way you want

Remove all the additional spaces and identify columns as below
let say Date
(?<=\d{2}-[A-Za-z]{3}-\d{4})(\s{1}|\b \b)
this regex will out you a selected spaces and use Regex.Replace method to replace that space as your delimiter
Regex.Replace(pdf_text,“(?<=\d{2}-[A-Za-z]{3}-\d{4})(\s{1}|\b \b)”,“!”)

Ather_Alam_Khan · June 26, 2020, 6:34am

@SamanGuruge Yes you got it right and thank you for getting into it.

Any example of extracting data from pdf and then to write / build CSV. I guess it will require data table to create one.

Thank you

SamanGuruge · June 26, 2020, 7:03am

hi you can create your own data table based on your details as an eg.you can create table as

Date, Particulars ,Debit ,Credit ,Balance
07-Apr-2020| ATM Withdrawal FROM 060137, Apr 7 2020|1:07PM |00,000.00| 00,500.00

in my project what I’m doing the same thing

I’ll read it using pdf to text reader and it will give me string output
I’ll remove all the unnecessary Spaces by looping each char
I’ll use regex and remove all the unnecessary things in that string-like headers, page numbers, summery, etc.
remaining details are the details which I needed
now I’m using regex lookups to identify each column
using lookup values I’m looking for the space that next to the identified value and replace that value using my column separator(Delimiter)
Note: for delimiter use non-existing char in your context
in here u can clearly identify 4 columns
Date, Credit, Debit, Balance
in your regex Date will identify in one regex and rest of the column are identified in one regex

Note: Use Lookbehind for Date and Lookahead for Credit Debit and Balance by doing that you can automatically separate Particular column that is the most hard column to identify

Happy Automation

Ather_Alam_Khan · June 26, 2020, 7:44pm

hi Saman,
I tried the steps you asked me to follow but failed.

Is it possible for us to connect and get this done.

Thank you

SamanGuruge · June 27, 2020, 3:02am

sure

Khushboo_Chandra · August 19, 2021, 5:00am

I also need to extract first name last name from the PDF and it is in format

To,

Khushboo Chandra

by using regex , I tried with multiple methods but not getting exact
please help me

SamanGuruge · August 19, 2021, 5:12am

hi @Khushboo_Chandra

Please send me a sample PDF.

Topic		Replies	Views
Need regex to extract the data Activities activities , question , document_processing	13	1303	February 22, 2021
Can help on regex activity Activities uiautomation , pdf-extraction , pdf-to-excel	9	44	December 21, 2024
Regex for extracting data from a Table Activities pdf , activities , question	6	992	August 13, 2021
Regex issue, text group extract Activities pdf , activities , regex , question , regex-extractor	6	1026	February 2, 2022
Regex assistance Help pdf , data_scraping , regex , question	8	1267	November 11, 2019

Most Active Users - Yesterday
ashokkarale
singh_sumit
alonica
SorenB
manasrlenka25
Firat_SECKIN_BULUT
Lahiru.Fernando
ruchirmahajan
RyanTSG
Nitesh
More details...

Need help with regex to extract from pdf

Related topics