Need help with regex to extract from pdf

Dear All,

I am actually trying to extract data from a pdf bank statement to move into a excel sheet. I used “Read PDF Text” activity and “Write Text” activity to extract text into a text file.

Now the output is like:

BRANCH NAME
XXXXXXXXX Branch City

A/C TYPE A/C NUMBER
XXXXX XXXXX XXXXXXX Demand Deposits X-X-XX-XXXXX-XXX-XXXXX
HOUSE#X-XXX XXXXX-4-X XXXXXXXX XXXXX XXXXX
IQBAL, Karachi, Pakistan XX97XXXX0123456789012345

CURRENCY FROM TO PRINTED ON
PKR 01-Apr-2020 30-Apr-2020 07-May-2020

Date Particulars Debit Credit Balance

Opening Balance 00,500.00
07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:07PM 00,000.00 00,500.00
#1269297
07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:08PM 58,000.00 28,536.00
#1269297
07-Apr-2020 Web Fund Transfer to Bank (SCB), Acc: XXXXX 0,500.00 28,036.00
XXXX XXXXX (XXXXXXXXXXXXX),DT:Apr 7 2020 ,003621
07-Apr-2020 Web Fund Transfer to Bank (XXXXXXXX), Acc: XXXXX 0,500.00 00,500.00
XXXX XXXX (XXXXXXXXXXXX),DT:Apr 7 2020 ,003033
EFT Rec from Bank (XXXXX) Account
07-Apr-2020 (XXXXXXXXXXXXX),Sender (- 00,000.00 00,500.00
),DT:0407,123204Remitter msg :Miscellane

Now I need to extract the credit and debit transaction details into an excel sheet into different columns and rows. Since I am very new to regex and have tried creating patterns on https://regexr.com/ but no luck so far. Any help would be highly appreciated.

Thank you
Ather

can you please tell me what are the specific details you want from the credit and debit transaction details?

so this text is full of many things, like the branch name on top, account type, account number, name address etc

the line " 07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:07PM 00,000.00 00,500.00
#1269297" states [date of transaction] [narration or details of transaction with date and time] [debited amount] [credit] [balance]

The lines above tell two things
[Date] [Particulars] [Debit] [Credit] [Balance]
then [opening balance]

at first i would try to split the whole string into substrings. e.g. that your regex only searches in the string " 07-Apr-2020 ATM Withdrawal FROM 060137, Apr 7 2020 1:07PM 00,000.00 00,500.00"

for splitting it i would try splitting by line breaks.

Afterwards you can just loop through the array and grab the information via regex. e.g. (\d{2}-\w{3}-\d{4}) would grab the first date of the string (07-Apr-2020).

I am still not 100% sure what you are looking for?
Can you bold the output you want in your original post?
Can you tell me more about the patterns of text also.

1 Like

I am actually trying convert a pdf bank statement into excel and you understand a bank statement is always full several details like customer name, address, account number, account type, IBAN, statement from (date), statement to (date), printed on (date) and then it comes to transaction details (debit and credit)

Hello

I am not 100% certain what you need to capture but I think I understand your requirement.
Have a look at the below solutions:
Regex Solution #1:
(\d\d-\D\D\D-20\d\d)\s(.*)\s(\w+\s\d+)(?=, ).\s(\D\D\D\s\d+\s20\d+)\s(\d+:\d+PM|\d+:\d+AM)\s([0-9,.]+)\s([0-9,.]+)\s(#\d+)

Check out the Regex 101 link

Screenshots:


To write out each variable use an Assign activity with the following:
Match 1:

INSERTVARIABLE(0).Groups(1).ToString

Match 2:

INSERTVARIABLE(0).Groups(2).ToString

Match 3:

INSERTVARIABLE(0).Groups(3).ToString

Just update the capital letters with your output variable from the Matches activity.

Regex Solution #2:
Comments: If this wording (ATM Withdrawal FROM 060137) should be all one group (see red highlights) then try the below Regex Solution. You could also add group 2 + 3 together from the above solution:
(\d\d-\D\D\D-20\d\d)\s(.*)(?=, ).\s(\D\D\D\s\d+\s20\d+)\s(\d+:\d+PM|\d+:\d+AM)\s([0-9,.]+)\s([0-9,.]+)\s(#\d+)

Check out the Regex101 Link

Hi @Ather_Alam_Khan,

I’m also working on a Bank reconciliation project and I got several bank pdf like you mention in the post. you need to extract the table form your pdf right. Try to build CSV from your pdf file and parse that into Generate Data table activity with your delimiter. then it will give you the table and you can write that data into excel as the way you want

Remove all the additional spaces and identify columns as below
let say Date
(?<=\d{2}-[A-Za-z]{3}-\d{4})(\s{1}|\b \b)
this regex will out you a selected spaces and use Regex.Replace method to replace that space as your delimiter
Regex.Replace(pdf_text,"(?<=\d{2}-[A-Za-z]{3}-\d{4})(\s{1}|\b \b)","!")

1 Like

@SamanGuruge Yes you got it right and thank you for getting into it.

Any example of extracting data from pdf and then to write / build CSV. I guess it will require data table to create one.

Thank you

hi you can create your own data table based on your details as an eg.you can create table as

Date, Particulars ,Debit ,Credit ,Balance
07-Apr-2020| ATM Withdrawal FROM 060137, Apr 7 2020|1:07PM |00,000.00| 00,500.00

in my project what I’m doing the same thing

  1. I’ll read it using pdf to text reader and it will give me string output
  2. I’ll remove all the unnecessary Spaces by looping each char
  3. I’ll use regex and remove all the unnecessary things in that string-like headers, page numbers, summery, etc.
  4. remaining details are the details which I needed
  5. now I’m using regex lookups to identify each column
  6. using lookup values I’m looking for the space that next to the identified value and replace that value using my column separator(Delimiter)
    Note: for delimiter use non-existing char in your context
  7. in here u can clearly identify 4 columns
    Date, Credit, Debit, Balance
  8. in your regex Date will identify in one regex and rest of the column are identified in one regex

Note: Use Lookbehind for Date and Lookahead for Credit Debit and Balance by doing that you can automatically separate Particular column that is the most hard column to identify

Happy Automation :smiley:

hi Saman,
I tried the steps you asked me to follow but failed.

Is it possible for us to connect and get this done.

Thank you

sure