Hello
I scraped information from a PDF. This word doc has the output. I need to use REGEX to extract certain fields. The words highlighted in blue are the names of the “fields” in the PDF. The highlighted words in yellow is the info I need to extract using REGEX. Can someone help me build the codes? Any assistance will be appreciated.
extracted PDF info.docx (15.9 KB)
Hi @gustavo_marrufo
Just to know there are common field as well Transferre name which are occuring twice, did u data from that as well, or noy hihghlighted data needed to be extracted?
Yes I did notice, but the highlighted in yellow is the information I need to extract using REGEX
Hi @gustavo_marrufo
Below are regex patterns
For the word,
U.S. US Fish & Wildlife Service/Region 7 ------ (?<=DEPARTMENT OR AGENCY, BUREAU OR SERVICE, AND LOCATION SHOWN ON SUBVOUCHERS BUR. VOU. NO.\s+).*
CARRIER'S BILL NUMBER ------------------ (?<=CARRIER'S BILL NUMBER )\w+
Transferee:------------------------------- (?<=^Transferee: )\w+ , U had to use set multiline option here
GBL Number: ---------------------------- (?<=PAYEE’S CERTIFICATE\s+GBL Number: ).*
TA Number ------------------------------- (?<=TA Number: ).*
Total Claimed --------------------------- (?<=TOTAL CLAIMED . )\$[\d\.]+
Invoice Number ------------------------ (?<=Invoice Number: )\w+
Total Charges ------------------------------ (?<=Total Charges\s+)[\d\.]+
These are regex for the specified document
Please ensure the multiline option is set for all
Regards,
Nived N
1 Like
Hi @gustavo_marrufo
If this resolves ur query
Kindly mark the appropriate answer as solution