Reading data from unstructured format

Hi All,

I want to read below highlighted fields from PDF File.

image

I tried read PDF activity and it gave below string as output.

Output:
Drawee Details
Drawee Name Drawee Country Drawee Bank Drawee Branch
AMCOR FLEXIBLES SELESTAT
SAS
FRANCE
Bill Details
Bank Ref Transaction Id Operation Transaction Date Value Date
1541FIGS181192 S88151100 Realisation 25-OCT-2018 25-OCT-2018
Currency Conversion Details
Type From CCY Amount Rate To CCY Amount
Sale EUR 49997.77 83.6267 INR 4181148.51
TRANSACTION DETAILS Invoice Details
Acc No Details CCY Amount Debit Credit Number
Date
CCY
Amount
035
18-SEP-2018
EUR
49997.77
Office Account 1541FIGS181192 EUR 49997.77
Office Account COMM ON FIGC INR 1250.00
Office Account SGST/UTGST @ 9% INR 112.50
Office Account CGST @ 9% INR 112.50
Office Account SGST/UTGST-currency
conversion @ 9%
INR 781.31
Office Account CGST-currency conversion @
9%
INR 781.31

Its reading the file in vertical order and how to fetch required highlighted fields. Could you please help me in this.

Thanks & Regards,
Lakshman Ganta.

I am not sure about reading the details from PDF in that fashion, nothing to worry, you can extract the required data from the output using string functions like contain, substring etc…

Hi,

By using string functions, we will get data that i know but here how to map those fields like Drawee Name, Bank Ref, Transaction Date etc…
Is there any other way to read this kind of files. If yes, Please let me know.

Regards,
Lakshman Ganta.

I dont think so, you have write code to extract each piece of information.

Hi Lakshman,
By the look of it, the pdf document looks as though that it’s not a scanned document. So pdf scraping using “Get Text” or Anchor Base might work.
If that’s not the case, the other option is to use “Scrape Relative” under “Desktop Recording”. This should work as it allows to identify reference elements - in your case, reference element is “Drawee Name” for instance.

1 Like

Hi,
If there is some stable pattern in the text extracted by Read PDF activity I usually use regex to parse it.

E.g. to retrieve Bank Ref, Transaction ID, Operation, Transaction Date and Value Date you could use:

“^Bank Ref.*[\r\n](?<bankref>.*?)\s(?<transid>.*?)\s(?<operation>.*?)\s(?<transdate>.*?)\s(?<valuedate>.*?)\s”

I often use https://regex101.com/ to build the regex
Cheers

Hi @J0ska,

I will try with Regular expression and will let you know.