Grab specific info in PDF text with Regex

Joanne_Chang_LAX · August 15, 2023, 11:30pm

This is the result after read Pdf and write text file.

I had to block out most of the information or else our customer information gets leak
But is there a way to get text under specific keywords using Regex?
I know there’s ways to get words infront or behind, but I don’t see a way to grab words below.
For example, I would like to get the blue column’s data under the B/L #, or the blue column under Container #.

The amount of container # also changes from file to file, sometimes there’s only one, sometimes multiple, sometimes none.

Please help, would appreciate a lot, Thank you.

Yoichi · August 16, 2023, 12:04am

Hi,

It may be better to use DocumentUnderstanding framework.

https://docs.uipath.com/document-understanding/standalone/2022.4/user-guide/introduction

If you need to extract them with regex, can you share specific input text and expected output as file? It’s no problem if dummy data.

Regards,

natanael.mendes · August 16, 2023, 12:58am

yes, it possible, like. Get all the pdf text and you can create an match for each one

Parvathy · August 16, 2023, 1:02am

Hi @Joanne_Chang_LAX

=> Use Read PDF Text or Read PDF with OCR to read the PDF and store the output in a variable say str_text.
=> Use Write Text File activity to write the pdf into the text file.
=> After writing the text file you can use Regex expressions to extract the text.

Share the sample text to extract so that we can help you with regex.

I think you have created duplicate post with same question.
https://forum.uipath.com/t/grab-specific-info-in-pdf-text/572712?u=parvathy
Check into that .

Hope it helps!!

Usha_Jyothi · August 16, 2023, 5:28am

Take the entire line in which you need to extract the data and extract the particular value by splitting the string by index

Hope this helps
Usha

Brian_Mathew_Maben · August 16, 2023, 6:14am

Hey @Joanne_Chang_LAX ,

You can make use of regex to retrieve the data after using the ‘Read PDF Activity’ provided that the data has specific patterns that make it unique.
Or you can use ‘Document Understanding’ which will require you to use certain intelligent packages. Here is a playlist below to help you get started.

Joanne_Chang_LAX · August 16, 2023, 3:43pm

Sorry I deleted that post as I wanted to be more detailed in my question, I’ve provided a sample text below, appreciate for the help

TEST.txt (2.5 KB)

Again, I would need the info after “LOAD PICKUP POOL ADDRESS”, under “B/L #”, under “CONTAINER #”, “LAST FREE DAY”, and "PICKUP # (the amount of items under the last three requirements are random, there might be none, there might be multiple)

Parvathy · August 16, 2023, 4:22pm

Hi @Joanne_Chang_LAX

Could you please the q=required output in Bold so that we can give you regex.

Regards

Joanne_Chang_LAX · August 16, 2023, 4:39pm

sorry I didn’t understand your meaning, can you state it again?

Parvathy · August 16, 2023, 4:41pm

Give the output that you need in bold @Joanne_Chang_LAX .

Regards

Joanne_Chang_LAX · August 16, 2023, 9:28pm

未命名文件 (2).docx (14.0 KB)

Brian_Mathew_Maben · August 17, 2023, 5:19am

Hey @Joanne_Chang_LAX ,

Check this workflow out.

AI_Forum.zip (11.5 KB)
Input FIle:
未命名文件 (2).pdf (30.9 KB)

In the Form extractor make sure you copy paste the API key that’s available in:

cloud.uipath.com> Admin > License > Robots & Services >Document Understanding > Copy API Key

Paste the API key above

Expected output:

Usha_Jyothi · August 17, 2023, 6:46am

for address
try this

Joanne_Chang_LAX · August 17, 2023, 5:29pm

can this get changing amounts and position of container # and pickup number?

Brian_Mathew_Maben · August 18, 2023, 4:24am

Hey @Joanne_Chang_LAX ,

It depends on the custom area we provide in the ‘Form Extractor’

The area shaded in grey is the custom area provided and hence it will retrieve all the data within that area.

That is why we see ‘Container1 Container2’ in the output below.

I would urge you to take a look at this play list as well.

Topic		Replies	Views
Extract certain key words from multiple pdfs Activities pdf , activities , question	8	915	February 8, 2022
Need REGEX code for extracted PDF info Studio studio , question , find_references	4	847	September 20, 2021
How to read the specific data in pdf Activities pdf , activities , question	33	4937	June 2, 2021
Get Specific words from a text Studio studio , question , highlight_elements	6	907	April 24, 2023
How To Extract Data From PDF Using 'Read PDF Text' And RegEx ? Knowledge Base activities	0	522	August 8, 2023

Grab specific info in PDF text with Regex

Related topics