Date extraction from PDF is not working!

Joker · March 31, 2022, 6:03am

I am trying to extract PDF which has question below that date is mention. I applied anchor base method to extract date from it but it is not working!!

Any help would be much appreciated!

Thanks!

supermanPunch · March 31, 2022, 6:05am

Hi @Joker ,

Is there a Reason that you are using Ui Automation for this Case?

We could read the PDF Data and Perform a Regex Match to get the Desired Result if the PDF is a Digital PDF.

If you still want to Continue on UI Automation, Could you Provide us the Screenshots of the Selector of Date Element in UiExplorer ?

Joker · March 31, 2022, 6:14am

Hi,

Regex isn’t one of my strength so I thought taking another approach. Moreover, there are multiple dates present in the document. However, I am looking for specific date. Is that possible to extract using Regex?
Below is the image of the UI Explorer
This is when selected at first

image1239×748 24 KB

It shows Validate option in red which meant not working as expected.

Thus, I also tried to open UI explorer and re indicate the element and save it.
But then I got the below error
Get Text: The specified combination of selector, filter and scope is not supported.

ushu · March 31, 2022, 6:25am

@Joker UI automation will not be reliable for the pdfs. It is possible to apply regex if the pdf can read to text format

Please check the below workflow on how to convert pdf to text. Share the output in the notpad then we will assist you how to apply regex

PDF.zip (2.0 KB)

Note: For this you need to install UiPath.PDF.activities

Capture

supermanPunch · March 31, 2022, 6:39am

@Joker , It does seem that the Selectors cannot be guaranteed for the PDF data.

However, if there always is a possibility of using a Background automation than Foreground/UI we would need to go with the Background automation.

In this case, Reading the PDF data as Text and performing either String/Regex operations to get the desired data.

If you could provide us with the Pdf Data in Text format, we would be able to suggest you with a Regex Pattern that we can use to Extract the Date.

On a First Look Basis of the Input data, we can assume to use the Following Regex Patterns :
1.

(?<=When did it happen\?)\n.*

Expression to get the Match Value :

System.Text.RegularExpressions.Regex.Match(InputString,"(?<=When did it happen\?)\n.*",RegexOptions.IgnoreCase).Value.ToString.Trim

(?<=When did it happen\?).*

Expression to get the Match Value :

System.Text.RegularExpressions.Regex.Match(InputString,"(?<=When did it happen\?).*",RegexOptions.IgnoreCase).Value.ToString.Trim

Let us know if either of the Expression doesn’t work and Provide us with Text Data, so we could provide you with an Accurate regex pattern.

Joker · March 31, 2022, 6:55am

Hi, thank you suggestion. I have read the PDF into text. But I put a hard stop when it came to Regex. However, I think I will try it now with Regex.

Joker · March 31, 2022, 6:59am

That is informative. It gave me a great insight. So basically, for accurate PDF extraction we should go with Regex. Is there any site which can help in fetching correct regex which can be simply applied to UI Path?

Thanks a lot for the Regex expression. I would surely go over Regex course to get an complete understanding.

I appreciate a lot for your help.

Let me try and if there’s any issue with it I will surely ask for your help.

Cheers!

supermanPunch · March 31, 2022, 7:01am

@Joker

Very Informative posts have already been provided in the forum, like the below :

The Extraction of Data depends on the Input data, so it differs for different data.

Joker · March 31, 2022, 10:02am

I tried (?<=When did it happen\?).* expression. It is printing empty in the console

supermanPunch · March 31, 2022, 10:04am

@Joker , We would require to inspect your Pdf text data. If you could provide us the Text data, We can give you the correct regex pattern.

shreyash_shirbhate · March 31, 2022, 10:09am

@Joker If you are not able to give us the pdf text data then you can go through the below video and find out the solution.

Joker · March 31, 2022, 10:14am

@supermanPunch @shreyash_shirbhate
I would have provided without a doubt however, it has sensitive data that restricts me to share it. The PDF is behaving weird as for some text it is showing and for some it is not. Not sure why?

shreyash_shirbhate · March 31, 2022, 10:15am

It is a Digital PDF right? @Joker

Joker · March 31, 2022, 10:18am

Not sure, how do I check it?

Angel_Llull · March 31, 2022, 10:18am

Hello @Joker,

You should try using regex or Data Scraping

shreyash_shirbhate · March 31, 2022, 10:18am

It is not handwritten right? @Joker Which mean it is digital.

Joker · March 31, 2022, 10:19am

ohh! no it is not handwritten. It is digital

shreyash_shirbhate · March 31, 2022, 10:20am

@Joker

Can you please try below tutorial?

Joker · March 31, 2022, 10:20am

Already tried regex. Will data scraping resolve that issue?

supermanPunch · March 31, 2022, 10:22am

@Joker If the Data was not extracted using Read PDF Text, it can also be that the PDF is a Scanned PDF, meaning there are images in it.

We would require you to confirm on the PDF types present, as we cannot proceed with Normal PDF Activities if that is the case.

Topic		Replies	Views
Document Data recognition Studio studio , question , output_panel	7	238	February 11, 2024
Extrat selected data from PDF Activities uiautomation , activities , question	4	630	November 11, 2022
Extract specific text from pdf to excel Help	12	2861	June 11, 2019
I want to read specific text from pdf . How should I read it Studio uiautomation , pdf , question , pdf-extraction	49	1861	May 4, 2023
Trying to extract date from a webpage or pdf Help pdf	14	7727	October 29, 2018

Date extraction from PDF is not working!

Related topics