Trying to extract date from a webpage or pdf

Hello everyone,

I am using UiPath for a few weeks now, but I got stuck at a thing I need to do.

I am trying to upload documents in a CMS form, and add to additional metadata.

Most of the data is in an Excel file, but I need to extract the date the document is published from the
document itself. The document is a pdf, and I also have html files which are derived from the pdf’s.

I am trying to extract the date from the documents, so I can add them as metadata to the CMS form. I need the last date that is in the document. Furthermore the pdf’s are all different in size, varying between 5 and 20 pages, and the date is always somewhere at the end, but not exactly at the same place.

The date (above the blue line) I need always comes after "ter openbare zitting van ", in each document. (I put a red line underneath it)
The text is in Dutch, as well the format for the date. I do not need OCR to get the text, I can copy it.

Is there anyone who has an idea how I can assign the date to a string variable, so I can eventually paste it into the form in need to upload the document in?

I already tried anchor bases together with “get text”, but so far unsuccesfull. If you can help me, I will be very happy!

Thank you in advance!

what about -
This reads the text from a pdf into a string. You can then strip out the desired string after "ter openbare zitting van ".

pdfText.Substring(pdfText.IndexOf("ter openbare zitting van ")+25, 15) - probably not the most elegant solution but may should work

1 Like

To extract the specific value you need to find the start index and end index of the value and pass these index and get the specific value by using substring

1 Like

Thank you! I will try that and I will let you know if it works!

I used your method, and it worked. I used it in a slightly by referring to another sentence, but the methods works great.

Actually, it doesn’t work: it starts counting from the top of the pdf instead, instead of after "ter openbare zitting van "). There is only one "ter openbare zitting van " in the pdf, so that’s not the problem. What I need is the date: “15 november 2013”, at the end of the pdf.

I only thought it worked because I had one pdf which had the date also right on top, but most do not have that.

I attached the pdf.

Flinterbay.pdf (260.8 KB)

It may be related to the text being returned the read pdf activity. I notice that “ter openbare zitting van” is split over two lines. Are there any line breaks in the text that is returned. Can you share your code?

Yes, I can, although I messed up the code a bit because I am working on it. Thanks for your reply though.

1.4 Date from PDF (Sequence)
Private = False
1.27 Read PDF text (ReadPDFText)
FileName = C:\Users\XXX\Documents\SVdownload\Flinterbay.pdf
Range = All
Text = PDFtext
Private = False
1.21 Assign (Assign)
To = Datum
Value = pdfText.Substring(pdfText.IndexOf("ter openbare zitting van “)+25, 15)
Private = False
1.18 Write line (WriteLine)
Text = datum
Private = False
1.13 Assign (Assign)
To = DatumSplit
Value = Datum.Split({” "c})
Private = False
1.8 Assign (Assign)
To = Dag
Value = DatumSplit(0)
Private = False
1.5 Write line (WriteLine)
Text = Dag
Private = False

In fact, this is not the whole code, but only the part in which I try to extract the date. Other parts are filling in forms, which is not so relevant.
The output is

pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)

I got the date output from your file using that string

1 Like

I tried it with the pdf you send and in works! Genius, thanks a lot!
I will try the rest, but I think this is a good solution. What are the replaces doing if I may ask?

removing carriage return and line feed characters.

Basically the read PDF function reads the PDF with these breaks included rather than one continuous string.

1 Like

Oh that makes sense. So if it cannot find the “ter openbare zitting” sentence it will start right from the beginning, am I right? Again, thanks a lot, I would never have come up with that!

To add on this: I tried a large sample and it all works!

yes it looks for the index of the string - if it cant find it the zero will be used.

Good luck with the rest of your project!

I have tried the same…
pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)
It showing compile error.Why i am getting this error.Please help