I am using UiPath for a few weeks now, but I got stuck at a thing I need to do.
I am trying to upload documents in a CMS form, and add to additional metadata.
Most of the data is in an Excel file, but I need to extract the date the document is published from the
document itself. The document is a pdf, and I also have html files which are derived from the pdf’s.
I am trying to extract the date from the documents, so I can add them as metadata to the CMS form. I need the last date that is in the document. Furthermore the pdf’s are all different in size, varying between 5 and 20 pages, and the date is always somewhere at the end, but not exactly at the same place.
The date (above the blue line) I need always comes after "ter openbare zitting van ", in each document. (I put a red line underneath it)
The text is in Dutch, as well the format for the date. I do not need OCR to get the text, I can copy it.
Is there anyone who has an idea how I can assign the date to a string variable, so I can eventually paste it into the form in need to upload the document in?
I already tried anchor bases together with “get text”, but so far unsuccesfull. If you can help me, I will be very happy!
To extract the specific value you need to find the start index and end index of the value and pass these index and get the specific value by using substring
Actually, it doesn’t work: it starts counting from the top of the pdf instead, instead of after "ter openbare zitting van "). There is only one "ter openbare zitting van " in the pdf, so that’s not the problem. What I need is the date: “15 november 2013”, at the end of the pdf.
I only thought it worked because I had one pdf which had the date also right on top, but most do not have that.
It may be related to the text being returned the read pdf activity. I notice that “ter openbare zitting van” is split over two lines. Are there any line breaks in the text that is returned. Can you share your code?
Yes, I can, although I messed up the code a bit because I am working on it. Thanks for your reply though.
1.4 Date from PDF (Sequence)
Private = False
Variables
Q(IEnumerable<KeyValuePair<Rectangle,String>>)
pdf(String)
PDFtext(String)
Activities
1.27 Read PDF text (ReadPDFText)
FileName = C:\Users\XXX\Documents\SVdownload\Flinterbay.pdf
Range = All
Text = PDFtext
Private = False
1.21 Assign (Assign)
To = Datum
Value = pdfText.Substring(pdfText.IndexOf("ter openbare zitting van “)+25, 15)
Private = False
1.18 Write line (WriteLine)
Text = datum
Private = False
1.13 Assign (Assign)
To = DatumSplit
Value = Datum.Split({” "c})
Private = False
1.8 Assign (Assign)
To = Dag
Value = DatumSplit(0)
Private = False
1.5 Write line (WriteLine)
Text = Dag
Private = False
In fact, this is not the whole code, but only the part in which I try to extract the date. Other parts are filling in forms, which is not so relevant.
The output is
013 VAN HET TUC
013
I tried it with the pdf you send and in works! Genius, thanks a lot!
I will try the rest, but I think this is a good solution. What are the replaces doing if I may ask?
Oh that makes sense. So if it cannot find the “ter openbare zitting” sentence it will start right from the beginning, am I right? Again, thanks a lot, I would never have come up with that!
To add on this: I tried a large sample and it all works!
I have tried the same…
pdf.Replace(vbCr, “”).Replace(vbLf, “”).Substring(pdf.Replace(vbCr, “”).Replace(vbLf, “”).IndexOf("ter openbare zitting van ")+25, 16)
It showing compile error.Why i am getting this error.Please help