Hello. I am trying to extract a series of transaction descriptions from a sample bank statement PDF below. The length of the descriptions can vary a lot, that’s why it can be very irregular in format. I tried to extract the long text of the whole description and separate them by a delimiter, but it doesn’t seem to have any clear delimiter for me to separate too.
As attached below, each transaction is identified by each box highlighted below. The description length could vary and I could not use Document Understanding to extract one by one
I did see I can probably separate them by using delimiter “NewLine” with no space. But the text I extracted out using various OCR engines always did not include the two extra spaces.
What is the better way extract an irregular formatted data like this? Would appreciate some help.
Use the Read PDF Text (use both preserve format and without preserving the format) to convert the PDF data to text form and save it to a text file using the Write Text File Activity. Analyse the Text and see if you can use the regex to extract the required data. If possible please share the text file.
I got the text files I extracted with and without preserving format. Unfortunately I cant attach them here as UiPath says new users cannot upload attachments. So here is a screenshot of the extracted text without formatting.
ABC Bank Berhad
1 st 100 Jalan Tun Perak, 50050 Kuala @ABCBank Lumpur, Malaysia
SECTION 14, RF
MUKN PAGE 1
TARIKH PENYATA
ABCDE SDN. BHD. 30/04/21
LEVEL 12, ABCD CENTRE STATEMENT DATE
46200 PETALING JAYA SELANGOR NOMBOR AKAUN
123456789
ACCOUNT
NUMBER
PROTECTED BY PIDM UP TO RM250,OOO FOR EACH DEPOSITOR CORPORATE CURRENT ACCOUNT
TARIKH
MASUK
ENTRY DATE
TARIKH Nil-Al
VALUE DATE
BUTIR URUSNIAGA
TRANSACTION DESCRIPTION
JUMLAH URUSNIAGA
TRANSACTION AMOUNT
BAKI PENYATA
STATEMENT BALANCE
BEGINNING BALANCE 1,000.00
PAYMENT DEBIT - APS /OTHERS 900.00
01/04 MAS PAYMENT 100.00-
Merchant pymt PV-00002
22 Mar to 31 Mar 21
MAS SERVICE CHARGE pymt 895.00
01/04 INTER-BANK PAYMENT INTO A/C 996.00
02/04 SPM CUSTODY FBO STR 5.00-
ABC COMPANY 101.00+
IBG TRANSACTION
INTER-BANK PAYMENT INTO A/C 1,098.00
07/04 SPM CUSTODY FBO STR
ABC COMPANY 102.00+
IBG TRANSACTION
INTER-BANK PAYMENT INTO A/C 1,201.00
07/04 SPM CUSTODY FBO STR
ABC COMPANY 103.00+
IBG TRANSACTION
PAYMENT DEBIT - APS /OTHERS 1,097.00
MAS PAYMENT
08/04 Merchant pymt 104.00-
1 Apr to 4 Apr 21
MAS SERVICE CHARGE pymt 1,094.50
INTER-BANK PAYMENT INTO A/C 1,199.50
08/04 SPM CUSTODY FBO STR 2.50-
08/04 ABC COMPANY 105.00+
IBG TRANSACTION
INTER-BANK PAYMENT INTO A/C 1,305.50
DMS A3 (FOR STRIPE)
08/04 1JJ72OD02JJ0187H394 106.00+
STRIPE UNIFIED PAYOU
INTER-BANK PAYMENT INTO A/C 1,412.50
PETROLIAM NASIONAL
08/04 PPR2931CDGH 107.00+
IF012198HH99O0008
TRANSFER FR A/C
1,304.50
09/04 108.00-
BAKI LEGAR BAKI AKHIR - CEK BELUM JELAS
LEDGER ENDING BALANCE - UNCLEARED CHEQUES Wang yang keluar
berlebihan ditandakan
BALANCE dengan DR
Perhatian / Note
Sempa maklumat dan baki ang dinyatakan di sini akan diangg:p betul
Overdrawn balances are
ketjdaktepatan dalam tempoh 21 hari. denoted by DRAll items and balances shown will be considered correct unless the Bank is
notified in writing of any discrepancies within 21 days.
(2) Sila beritahu kami sebaran pertukaran alamat secara bertulis.
Please notify us of any change of address in writing.
What I can observe is after splitting the description field with “STATEMENT BALANCE” and “LEDGER ENDING BALANCE”, you removed the date and replaced it with a space, and also removed the amount followed by the description. Please correct me if I am wrong.
That could work as each transaction starts with the date first… but also i find out my sample PDF bank statement has some formatting issue. The date should be aligning with the first line of description field.
But your solution could work, I will try out on my case when i make the PDF format correct. I will update if it works or not soon.