Trying to extract date from a webpage or pdf

pdf
text

#1

Hello everyone,

I am using UiPath for a few weeks now, but I got stuck at a thing I need to do.

I am trying to upload documents in a CMS form, and add to additional metadata.

Most of the data is in an Excel file, but I need to extract the date the document is published from the
document itself. The document is a pdf, and I also have html files which are derived from the pdf’s.

I am trying to extract the date from the documents, so I can add them as metadata to the CMS form. I need the last date that is in the document. Furthermore the pdf’s are all different in size, varying between 5 and 20 pages, and the date is always somewhere at the end, but not exactly at the same place.

The date (above the blue line) I need always comes after "ter openbare zitting van ", in each document. (I put a red line underneath it)
The text is in Dutch, as well the format for the date. I do not need OCR to get the text, I can copy it.

Is there anyone who has an idea how I can assign the date to a string variable, so I can eventually paste it into the form in need to upload the document in?

I already tried anchor bases together with “get text”, but so far unsuccesfull. If you can help me, I will be very happy!

Thank you in advance!


#2

what about - https://www.uipath.com/activities-guide/read-pdf-text
This reads the text from a pdf into a string. You can then strip out the desired string after "ter openbare zitting van ".

pdfText.Substring(pdfText.IndexOf("ter openbare zitting van ")+25, 15) - probably not the most elegant solution but may should work


#3

To extract the specific value you need to find the start index and end index of the value and pass these index and get the specific value by using substring


#4

Thank you! I will try that and I will let you know if it works!


#5

I used your method, and it worked. I used it in a slightly by referring to another sentence, but the methods works great.
Thanks!


#6

Actually, it doesn’t work: it starts counting from the top of the pdf instead, instead of after "ter openbare zitting van "). There is only one "ter openbare zitting van " in the pdf, so that’s not the problem. What I need is the date: “15 november 2013”, at the end of the pdf.

I only thought it worked because I had one pdf which had the date also right on top, but most do not have that.

I attached the pdf.

Flinterbay.pdf (260.8 KB)


#7

It may be related to the text being returned the read pdf activity. I notice that “ter openbare zitting van” is split over two lines. Are there any line breaks in the text that is returned. Can you share your code?


#8

Yes, I can, although I messed up the code a bit because I am working on it. Thanks for your reply though.

1.4 Date from PDF (Sequence)
Private = False
Variables
Q(IEnumerable<KeyValuePair<Rectangle,String>>)
pdf(String)
PDFtext(String)
Activities
1.27 Read PDF text (ReadPDFText)
FileName = C:\Users\XXX\Documents\SVdownload\Flinterbay.pdf
Range = All
Text = PDFtext
Private = False
1.21 Assign (Assign)
To = Datum
Value = pdfText.Substring(pdfText.IndexOf("ter openbare zitting van “)+25, 15)
Private = False
1.18 Write line (WriteLine)
Text = datum
Private = False
1.13 Assign (Assign)
To = DatumSplit
Value = Datum.Split({” "c})
Private = False
1.8 Assign (Assign)
To = Dag
Value = DatumSplit(0)
Private = False
1.5 Write line (WriteLine)
Text = Dag
Private = False

In fact, this is not the whole code, but only the part in which I try to extract the date. Other parts are filling in forms, which is not so relevant.
The output is
013 VAN HET TUC
013


#9

pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)

I got the date output from your file using that string


Extract info from a scanned document
#10

I tried it with the pdf you send and in works! Genius, thanks a lot!
I will try the rest, but I think this is a good solution. What are the replaces doing if I may ask?


#11

removing carriage return and line feed characters.

Basically the read PDF function reads the PDF with these breaks included rather than one continuous string.


#12

Oh that makes sense. So if it cannot find the “ter openbare zitting” sentence it will start right from the beginning, am I right? Again, thanks a lot, I would never have come up with that!

To add on this: I tried a large sample and it all works!


#13

yes it looks for the index of the string - if it cant find it the zero will be used.

Good luck with the rest of your project!


#14

I have tried the same…
pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)
It showing compile error.Why i am getting this error.Please help