Help on parsing pdf text

Hi all,

See text below. I need to break this string into 3 groups each starting with CHAIN: because each group pertains to a different reporting division and dates. Thoughts? This is in a datatable in Studio.

Column1
CHAIN: ODS390-001 DUMMY COMPANY WIRE DATE: 04/11/23
“04/11 02:32 04/11 POS 1,858,080.72 VISA & MASTERCARD 1,602,745.01”
CHAIN: ODS390-001 DUMMY COMPANY WIRE DATE: 04/11/23
04/10 06:32 04/10 POS 839.62 VISA & MASTERCARD 829.62
“CARD FEES 28,003.59 DB”
BILLING SYSTEM FEES: .00 DB
NET CHARGEBACKS: 336.76 CR
PROCESSING ADJUSTMENTS: .00 DB
CHAIN: ODS390-002 DUMMYCOMPANY WIRE DATE: 04/11/23
“04/10 22:49 04/10 EMD 840,764.18 VISA & MASTERCARD 817,939.83”
CHAIN: 0DS5390-002 DUMMY COMPANY WIRE DATE: 04/11/23
“CARD FEES 14,118.07 DB”
BILLING SYSTEM FEES: .00 DB
“NET CHARGEBACKS: 3,626.40 CR”
PROCESSING ADJUSTMENTS : .00 DB
CHAIN: ODS390-004 DUMMY COMPANY WIRE DATE: 04/11/23
“04/11 00:34 04/11 POS 3,498.27 VISA & MASTERCARD 1,628.69”
“04/11 02:32 04/11 POS 28,216.17 VISA & MASTERCARD 19,174.75”
CHAIN: ODS390-004 DUMMY COMPANY WIRE DATE: 04/11/23
CARD FEES 342.89 DB
BILLING SYSTEM FEES: .00 DB
NET CHARGEBACKS: .00 DB
PROCESSING ADJUSTMENTS: .00 DB
CHAIN: ODS390 DUMMY COMPANY WIRE DATE: 04/11/23

have a look here for an initial segmentation:

(CHAIN\:)[\s\S]+?(?=CHAIN\:|\z)

Some descriptions are unclear

better provide input text as textfile along with expected output sample

Thanks my friend. I tried this already but that only gives me the lines starting with CHAIN: I am trying to segment the groups that sit between each CHAIN: key word into different blocks of text. I appreciate the idea though.

Hi @Chris_Bolin

Did you try Splitting using “CHAIN:”

That’ll give you text in array which would be splitted using CHAIN:

and then you can apply your other operations.

Thanks,
Aditya

if I split on CHAIN: it will give me what is after “CHAIN” but only on the same line. It does not give me the next row of data. What I am attempting to do now is count how many times I find a key word, putting that to an array of Int and then looping through that array of int to split the text into group parts. Since I do not know how many times this new key word will appear in any given document, I created a counter beginning at 1 and in a do while I am splitting, looking for the data I need and then increasing a counter on the split until it reaches the counter for the number of times the word is found. It is just really code heavy which I do not like. If anyone has any other ideas hit me up

@Chris_Bolin
we can better help when you give the clear expected output

is too vague

was requested as we can reliable refer to linebreak for the prototypes

So isn’t this whole data kept at DT(0)(0)? Instead it is spread accross several rows?

If it is spread accross several rows, you can append all the data in a single string and then apply the split function.

Regards,
Aditya

it is in a string - after grouping I will generate a dt and filter that dt for the information I need.

the pdf is a highly structured pdf that I had to read with Tessaract OCR to get it into one string. There are potentially over 200 + lines that I ingest into the code. I do not want to write an append for each line. I am looking to group the data between the Key Word CHAIN: which seems to be a break in the data sets. Hope that helps understandng

@Chris_Bolin you can try the below regex, it will extract you the only the text without the word chain

(?<=CHAIN:)[\s\S]+?(?=CHAIN:|\z)

I found a solution myself

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.