Extracting PDF Unstructured Data with irregular format

Hello. I am trying to extract a series of transaction descriptions from a sample bank statement PDF below. The length of the descriptions can vary a lot, that’s why it can be very irregular in format. I tried to extract the long text of the whole description and separate them by a delimiter, but it doesn’t seem to have any clear delimiter for me to separate too.

As attached below, each transaction is identified by each box highlighted below. The description length could vary and I could not use Document Understanding to extract one by one

I did see I can probably separate them by using delimiter “NewLine” with no space. But the text I extracted out using various OCR engines always did not include the two extra spaces.

What is the better way extract an irregular formatted data like this? Would appreciate some help.

Thanks!

Hello @james.lee.33

Welcome to the UiPath Community.

Use the Read PDF Text (use both preserve format and without preserving the format) to convert the PDF data to text form and save it to a text file using the Write Text File Activity. Analyse the Text and see if you can use the regex to extract the required data. If possible please share the text file.

And also please read this post it will be very useful for you.

1 Like

Hi Kumar. Thanks for your help.

I got the text files I extracted with and without preserving format. Unfortunately I cant attach them here as UiPath says new users cannot upload attachments. So here is a screenshot of the extracted text without formatting.

Is there a way to extract only the part I highlighted?

Thanks

Without preserving the format, save the data into text file, copy the text and paste it in the reply section and post it.

ABC Bank Berhad
1 st 100 Jalan Tun Perak, 50050 Kuala @ABCBank Lumpur, Malaysia
SECTION 14, RF
MUKN PAGE 1
TARIKH PENYATA
ABCDE SDN. BHD. 30/04/21
LEVEL 12, ABCD CENTRE STATEMENT DATE
46200 PETALING JAYA SELANGOR NOMBOR AKAUN
123456789
ACCOUNT
NUMBER
PROTECTED BY PIDM UP TO RM250,OOO FOR EACH DEPOSITOR CORPORATE CURRENT ACCOUNT

TARIKH
MASUK
ENTRY DATE
TARIKH Nil-Al
VALUE DATE
BUTIR URUSNIAGA
TRANSACTION DESCRIPTION
JUMLAH URUSNIAGA
TRANSACTION AMOUNT
BAKI PENYATA
STATEMENT BALANCE

BEGINNING BALANCE 1,000.00
PAYMENT DEBIT - APS /OTHERS 900.00
01/04 MAS PAYMENT 100.00-
Merchant pymt PV-00002
22 Mar to 31 Mar 21
MAS SERVICE CHARGE pymt 895.00
01/04 INTER-BANK PAYMENT INTO A/C 996.00
02/04 SPM CUSTODY FBO STR 5.00-
ABC COMPANY 101.00+
IBG TRANSACTION
INTER-BANK PAYMENT INTO A/C 1,098.00
07/04 SPM CUSTODY FBO STR
ABC COMPANY 102.00+
IBG TRANSACTION
INTER-BANK PAYMENT INTO A/C 1,201.00
07/04 SPM CUSTODY FBO STR
ABC COMPANY 103.00+
IBG TRANSACTION
PAYMENT DEBIT - APS /OTHERS 1,097.00
MAS PAYMENT
08/04 Merchant pymt 104.00-
1 Apr to 4 Apr 21
MAS SERVICE CHARGE pymt 1,094.50
INTER-BANK PAYMENT INTO A/C 1,199.50
08/04 SPM CUSTODY FBO STR 2.50-
08/04 ABC COMPANY 105.00+
IBG TRANSACTION
INTER-BANK PAYMENT INTO A/C 1,305.50
DMS A3 (FOR STRIPE)
08/04 1JJ72OD02JJ0187H394 106.00+
STRIPE UNIFIED PAYOU
INTER-BANK PAYMENT INTO A/C 1,412.50
PETROLIAM NASIONAL
08/04 PPR2931CDGH 107.00+
IF012198HH99O0008
TRANSFER FR A/C
1,304.50
09/04 108.00-
BAKI LEGAR BAKI AKHIR - CEK BELUM JELAS
LEDGER ENDING BALANCE - UNCLEARED CHEQUES Wang yang keluar
berlebihan ditandakan
BALANCE dengan DR
Perhatian / Note
Sempa maklumat dan baki ang dinyatakan di sini akan diangg:p betul
Overdrawn balances are
ketjdaktepatan dalam tempoh 21 hari. denoted by DRAll items and balances shown will be considered correct unless the Bank is
notified in writing of any discrepancies within 21 days.
(2) Sila beritahu kami sebaran pertukaran alamat secara bertulis.
Please notify us of any change of address in writing.

Try this

Extract PDF Unstructured Data.xaml (5.1 KB)

What I can observe is after splitting the description field with “STATEMENT BALANCE” and “LEDGER ENDING BALANCE”, you removed the date and replaced it with a space, and also removed the amount followed by the description. Please correct me if I am wrong.

That could work as each transaction starts with the date first… but also i find out my sample PDF bank statement has some formatting issue. The date should be aligning with the first line of description field.

But your solution could work, I will try out on my case when i make the PDF format correct. I will update if it works or not soon.

Thanks

Replacing with null string, not space. Rest all is correct.

Alright I managed to extract the description one by one accordingly. Thanks for your help!

1 Like

Did you run it for every file without using the for loop?

I run it for every file using for loop. given the file has a fixed format

1 Like

Okay

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.