Regex code to pull text from pdf's

Hello community!

Can anyone help me with dynamic code to pull the text from pdf’s
below is the text from pdf i need the Text starting from Sub : to Members are advised to take note of the above and ensure compliance.
I have used Regex -(Sub:)[\w\W\d]+(?<=Members are advised to take note of the above and ensure compliance.) with this code but one pdf got failed and other pdf’s text got pulled with this code. what is dyanamic way to pull this specific text from pdf’s?
Below is pdf extracted text

@"National Stock Exchange of India
Circular

Department: Investigation

Download Ref No: NSE/INVG/58097 Date: August 24, 2023

Circular Ref. No: 190/2021

To All NSE Members

Sub : SAT Order in the matter of M/s Global Infratech and Finance Limited

This is with reference to NSE Circular No. NSE/INVG/48969 dated July 18, 2021 in respect of SEBI
Order No. WTM/MB/IVD/ID6/ 12613 /2021-22 dated July 16, 2021, NSE circular no.
NSE/INVG/49108 in respect of SAT order dated July 27, 2021 and NSE circular no.
NSE/INVG/49437 dated August 27, 2021 and SAT order dated August 24, 2021.

SAT vide its order dated August 24, 2023 issued against Appeal no. 513/514/515/556 of 2021
and 636/855 of 2022 made by Anoop Jain, Anoop Jain (HUF), Ritu Jain, Ms. Ammaji Anumolu and
Ms. Anumolu Harshitha has directed that the impugned order passed by the WTM cannot be
sustained and is quashed.

The detailed order is available on SEBI website (https://sat.gov.in/scripts/search.asp).

Further, the consolidated list of such entities is available on the Exchange website
http://www.nseindia.com home page at the below mentioned link:

Members are advised to take note of the above and ensure compliance.

In case of any further queries, members are requested to email us at dl-invsg-all@nse.co.in

For and on behalf of
National Stock Exchange of India Limited
National Stock Exchange of India

Sandesh Sawant
Senior Manager

ANNEXURE : SAT Order in the matter of M/s Global Infratech and Finance Limited
BEFORE THE SECURITIES APPELLATE TRIBUNAL
MUMBAI

Order Reserved on : 21.08.2023

Date of Decision : 24.08.2023

Appeal No. 513 of 2021

Hello

Try the below pattern:
(Sub\s*:\s*)[\w\W]+(?<=Members are advised to take note of the above and ensure compliance.)


Cheers

Steve

Hi

You can extract the data from splitting method. Let me know what exactly you need i will give you a solution.

@Steven_McKeering I want that till Members are advised to take note of the above and ensure compliance.

@Rajnish_Arora I need the pdf’s text starting from Sub : and ending Members are advised to take note of the above and ensure compliance.

Note that this text are there in multiple pdf’s so it is similar.

Hello

Updated the pattern in my first post. Please check again.

image


Here Your solution here.
Happy automation

@Rajnish_Arora


It is not working it is taking everything


used same code it is taking that coma also and there are multiple pdfs with different Sub name


@Steven_McKeering Your regex code doesnt worked when enter this code in my process it has not pulled any Pdf’s text

You entered the pattern wrong

Should be lowercase ‘\s’ not ‘\S’

Please check my pattern below

Also, you may want to make the whole pattern not case sensitive.

@Steven_McKeering @Rajnish_Arora



There are different subjects like this.
Usually in this process user want the name like sebi order in the matter,sat order in the matter,order in the matter,Interim order in.,etc…This are the various subjects if bot found this subject names it should download that pdf and it should extract the text from Sub: to members are advised to compliance…

Make your patten not case sensitive like this:
system.text.regularexpressions.regex.match(PDF_Readdt, “(Sub\s*:\s*)[\w\W]+(?<=Members are advised to take note of the above and ensure compliance.)”, RegexOptions.IgnoreCase)

@Steven_McKeering


It is showing error in assign.

You are missing “.Regex” in the syntax.

System.Text.RegularExpressions.Regex.Match

@Steven_McKeering yes that , RegexOptions.IgnoreCase).value after adding this it is working

@Steven_McKeering is it a right method?

H! @Priyesh_Shetty

you can try this way

reg.Split({“Sub :”},StringSplitOptions.RemoveEmptyEntries)(1).Split({“Members”},StringSplitOptions.RemoveEmptyEntries)(0).trim

find screenshot for reference

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.