Trying to extract date from a webpage or pdf

trabart · June 22, 2017, 2:33pm

Hello everyone,

I am using UiPath for a few weeks now, but I got stuck at a thing I need to do.

I am trying to upload documents in a CMS form, and add to additional metadata.

Most of the data is in an Excel file, but I need to extract the date the document is published from the
document itself. The document is a pdf, and I also have html files which are derived from the pdf’s.

I am trying to extract the date from the documents, so I can add them as metadata to the CMS form. I need the last date that is in the document. Furthermore the pdf’s are all different in size, varying between 5 and 20 pages, and the date is always somewhere at the end, but not exactly at the same place.

The date (above the blue line) I need always comes after "ter openbare zitting van ", in each document. (I put a red line underneath it)
The text is in Dutch, as well the format for the date. I do not need OCR to get the text, I can copy it.

Is there anyone who has an idea how I can assign the date to a string variable, so I can eventually paste it into the form in need to upload the document in?

I already tried anchor bases together with “get text”, but so far unsuccesfull. If you can help me, I will be very happy!

Thank you in advance!

Phiggins · June 22, 2017, 3:17pm

what about - https://www.uipath.com/activities-guide/read-pdf-text
This reads the text from a pdf into a string. You can then strip out the desired string after "ter openbare zitting van ".

pdfText.Substring(pdfText.IndexOf("ter openbare zitting van ")+25, 15) - probably not the most elegant solution but may should work

ddpadil · June 22, 2017, 3:32pm

To extract the specific value you need to find the start index and end index of the value and pass these index and get the specific value by using substring

trabart · June 22, 2017, 3:37pm

Thank you! I will try that and I will let you know if it works!

trabart · June 27, 2017, 7:16am

I used your method, and it worked. I used it in a slightly by referring to another sentence, but the methods works great.
Thanks!

trabart · June 28, 2017, 2:16pm

Actually, it doesn’t work: it starts counting from the top of the pdf instead, instead of after "ter openbare zitting van "). There is only one "ter openbare zitting van " in the pdf, so that’s not the problem. What I need is the date: “15 november 2013”, at the end of the pdf.

I only thought it worked because I had one pdf which had the date also right on top, but most do not have that.

I attached the pdf.

Flinterbay.pdf (260.8 KB)

Phiggins · June 28, 2017, 2:33pm

It may be related to the text being returned the read pdf activity. I notice that “ter openbare zitting van” is split over two lines. Are there any line breaks in the text that is returned. Can you share your code?

trabart · June 28, 2017, 2:44pm

Yes, I can, although I messed up the code a bit because I am working on it. Thanks for your reply though.

1.4 Date from PDF (Sequence)
Private = False
Variables
Q(IEnumerable<KeyValuePair<Rectangle,String>>)
pdf(String)
PDFtext(String)
Activities
1.27 Read PDF text (ReadPDFText)
FileName = C:\Users\XXX\Documents\SVdownload\Flinterbay.pdf
Range = All
Text = PDFtext
Private = False
1.21 Assign (Assign)
To = Datum
Value = pdfText.Substring(pdfText.IndexOf("ter openbare zitting van “)+25, 15)
Private = False
1.18 Write line (WriteLine)
Text = datum
Private = False
1.13 Assign (Assign)
To = DatumSplit
Value = Datum.Split({” "c})
Private = False
1.8 Assign (Assign)
To = Dag
Value = DatumSplit(0)
Private = False
1.5 Write line (WriteLine)
Text = Dag
Private = False

In fact, this is not the whole code, but only the part in which I try to extract the date. Other parts are filling in forms, which is not so relevant.
The output is
013 VAN HET TUC
013

Phiggins · June 28, 2017, 2:53pm

pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)

I got the date output from your file using that string

trabart · June 28, 2017, 2:58pm

I tried it with the pdf you send and in works! Genius, thanks a lot!
I will try the rest, but I think this is a good solution. What are the replaces doing if I may ask?

Phiggins · June 28, 2017, 3:01pm

removing carriage return and line feed characters.

Basically the read PDF function reads the PDF with these breaks included rather than one continuous string.

trabart · June 28, 2017, 3:14pm

Oh that makes sense. So if it cannot find the “ter openbare zitting” sentence it will start right from the beginning, am I right? Again, thanks a lot, I would never have come up with that!

To add on this: I tried a large sample and it all works!

Phiggins · June 28, 2017, 3:32pm

yes it looks for the index of the string - if it cant find it the zero will be used.

Good luck with the rest of your project!

anfinrozario · August 2, 2018, 4:28pm

I have tried the same…
pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)
It showing compile error.Why i am getting this error.Please help

Topic		Replies	Views
Extract data fromPDF Help	13	1067	October 2, 2019
Read Specific Data From PDF Help	19	2223	September 24, 2019
Extract data from pdf document Help pdf , activities , question	18	1620	February 3, 2020
Document Data recognition Studio studio , question , output_panel	7	155	February 11, 2024
How to Extract a particular Data from a pdf file? Help	11	7967	August 8, 2019

Most Active Users - Yesterday
ashokkarale
MD_Farhan1
Ajay_Mishra
postwick
Dheerendra_vishwakarma
Anil_G
chandreshsinh.jadeja
Gautham_Pattabiraman
vrdabberu
aravindbalineni123
More details...

Trying to extract date from a webpage or pdf

Related Topics