How to extract the table using Regex or String Manipulation

I have the below table which I need to extract in PDF
test.pdf (68.2 KB)
using Regex or String Manipulation(No OCR)
The set of values may differ from each pdf ie the no due date lines and their sublines are not constant.
There may be 3 sets or 4 sets or 1 set even.
And the line after the table is also not the same.

Hi @sunilkanth

Is the data in text file or any application?

use read pdftext activity and then write it into the text file then we can extract the data using regex

Hi @sunilkanth

Can you share variations how they might look and if you can share the pdf files that would really help in understanding the correct issue

cheers

Its a pdf file

test.pdf (68.2 KB)

I have created a sample file since i cannot share the actual one.

I am already doing that. I have trouble in writing regex for the above table.

Hi @sunilkanth

If you have all the values then you can split with account as that is there in all the tables

And then get the values between them use a generate datatable

cheers

i understand,you want to extract the whole table from the two pages right?

Yeah to generate the table first I need to select the table values from the rest of the pdf and then only I can pass them as the input to the ‘Generate data table’ activity.

And to select the table values I am not sure how to do it because of the number of times the word
Total occurs in not constant and the next line below the table is also not constant.

I have given 3 samples on the two pages.

I want to extract the from the Next Bill to the last line of the same table ie Total.

Due Date

Date:(?s)(.) Amount - RegEx to select everything between the two words (Date: and Amount)

matchesDueDate(0).ToString.Substring(5).Remove(matchesDueDate(0).ToString.Substring(5).Length-8) - To remove the extra selected word such as (Date:) and ( Amount)

This will help you.

Hi @sunilkanth

For extracting each table separately please use this

System.Text.RegularExpressions.Regex.Matches("YourString","Due Date(.|\n)*total\s+\d+.\d+")

This will give you all table string separately

cheers

I understand this will give you value for one set of data.But how will you know how many times to repeat this ?

I understand this will give you value for one set of data.But how will you know how many times to repeat this?

Hi @sunilkanth

It wont give one set but it will give you all matches

let us say you stored the output in a variable var(type:matchcollection) do a var.count will give you how many tables are retrieved

cheers

Thank You. It worked :grinning:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.