13.RPA Challenge - PDF Scrapping

badita · August 25, 2017, 11:10am

Given the sample dummy PDF attached the output should be
RPA dummy sample form .pdf (231.0 KB)

Unit: 423
1: Verify
2: Start
3: Stop
4: n/a
5: Start
…

Basically I need extract what squares are checked per each line.

vvaidya · August 25, 2017, 1:51pm

Few things:

Did Image automation since I was unable to do element based and I didn’t wanted to do Get text
Not an efficient solution
You need to adjust your pdf and 2nd click Offset (resolution sepcific), so robot can see the check boxes
Coded only for Verify checkbox for your understanding.
Other might have a better solution so use this as an alternative.

RPA13.zip (1003.0 KB)

PMCosmin · August 30, 2017, 8:29am

Hi!

I’ve used the idea of vvaidya and I’ve generalized it such that we obtain the desired output. In order to output the values in an ordered manner I’ve changed a little bit the pdf’s content. Challenge13_PDFScrapping.zip (198.6 KB)

vvaidya · August 30, 2017, 2:07pm

All this is staged so others can take up the challenge(s) seriously?

vvaidya · August 30, 2017, 2:17pm

Apart from seasoned guys like @badita @andrzej.kniola @richarddenton @aksh1yadav @acaciomelo who did/doing tremendous job in bringing this forum to next level.

@ddpadil @ClaytonM @Florent_Salendres @sfranzen are doing really great work helping out the people. Hope others get inspired as well and contribute their knowledge to the forum.

Just thought of appreciating you guys for your work. Keep up the great work! Keep rocking!

PMCosmin · August 30, 2017, 6:21pm

I realized that ‘my solution’ wasn’t actually a valid one. It was just a poor adaptation of the vvaidya’s method. I hope that someone will really solve this.

beesheep · August 30, 2017, 8:57pm

Seems my contribution is not being noticed, haha just kidding… Thanks all that contributed in any level.

richarddenton · August 30, 2017, 10:33pm

Who is this @beesheep guy? Anyone heard of him?! haha.

I just ask the annoying questions and line them up for you guys who actually know what you’re talking about

vvaidya · August 31, 2017, 12:37am

It did not go unnoticed.Its just based on quarterly results (I see the above guys in every other thread)

https://forum.uipath.com/u?period=quarterly

beesheep · August 31, 2017, 1:08am

Hello,

first of, the challenge is also to understand if the pdf is standard or not… so here are a few warnings before executing…

if you can get the unit number from elsewhere use that one.
I had to save the PDF as pdf (weird right) this was in order not to lose the data where I was able to determine the checkbox status.

3 this has to be opened in Chrome. because I have no pdf client in my pc… lol
4 Perhaps you need to re scrap the area.

How did I came to a conclusion, well this is how.

I screen scrap the pdf in chrome, this is the result.

El START I] STOP VERlFY
START [3 STOP :1 VERlFY
I] START STOP C] VERlFY
[j START [1 STOP VERlFY
START I: STOP [1 VERlFY
[I START [:1 STOP C] VERlFY
Cl START I: STOP I: VERlFY
START III STOP CI VERlFY
El START I] STOP :1 VERlFY
El START [3 STOP El VERlFY

So if you take a closer look every checkbox marked, does not show any weird string, in other words the one preced for the posible conditions(start , stop, verify) is the one that is checked… please compare the list with the actual dummy pdf @badita uploaded.

From there it was easy split the entire text into lines, then each line it has to be splitted using " " and then check if the length of the string has 5 or 6, for instance the lenght is 6 means none of the boxes was selected otherwise one was selected.

so if start was selected no string is before the actual word, so the split array with the word “start” selected will be 0, if “stop” was selected this means that the array in this line is 2 (because a weird string is 0 and start is 1 and stop is 2) and it has to be stop. and so on.

the rest of building a data table and adding row, append activity the reader can find information here in the forum.

this is the result:

hope this helps,
13.RPA Challenge - PDF Scrapping.zip (1.0 MB)

aksh1yadav · September 1, 2017, 1:50pm

@ddpadil - you should also give it a try. I hope you have something ?

and @certified - please try to attempt things at least. Don’t bother about just present yourself. may be your logic will be best to deal with presented scenarios .

Hoping for all your replies now onward where you think you can help or advice something better

Regards…!!
Aksh

beesheep · September 1, 2017, 2:15pm

Good Idea @aksh1yadav, thank you for encouraging all of us…

ddpadil · September 1, 2017, 2:19pm

Here we go @aksh1yadav is back as always dragging me in .
Scrolled down the page till the end but couldn’t see you solution file . .

aksh1yadav · September 1, 2017, 2:21pm

It is not about dragging pal. if you think so will not do it again…sorry…!!

Just wanted to best out from you but ok won’t do

Because i am Rookie

ddpadil · September 1, 2017, 2:46pm

Really …
come man we’re like best pals in forum.
Just messing with you.
Tag me in any post np.
Your the one of the reasons why i’m here
learned lot from you man.
Thanks sharing knowledge

i said this before and i say this again @aksh1yadav your legend.
#FallOfFame.
May be we should cut the chase and let other discuss on the post
else we’ll be flagged by us only

richarddenton · September 6, 2017, 10:39am

@badita Perhaps the moderators should have a Bromance flag?!

ddpadil · September 6, 2017, 10:43am

LOL @aksh1yadav

aksh1yadav · September 6, 2017, 4:52pm

hahah …

Well rich then i’ll use it for you as well

jmy · April 25, 2018, 6:45am

my first comment

charliefik · August 2, 2018, 9:07am

I tried scraping the entire page (iexplorer, I know that beesheep did it in chrome but I wanted to try and make it a bit more robust for me as I’ve had so many issues with OCR scraping) and that was just useless so much data was not scraped especially the bits you actually needed. I thought it was better to find an image in each row of data and then just search for the tick boxes in an area in that row but I’ve had some issues with that approach

(I put it in rookies as this topic was last updated ages ago and I think I need some help at this point)

Topic		Replies	Views
Challenge 13 Help	10	2671	August 18, 2018
Multiple choice PDF ocr Help studio	4	2656	May 1, 2018
Scanned PDF having data in square boxes Help pdf , ocr , activities	5	4060	January 30, 2017
Need help to extract Pdf data from tick box with ocr , tried screen scraping but no output Studio pdf-extraction	2	1716	November 29, 2021
Sample for the corresponding Video tutorial on PDF extraction Help	1	1790	May 19, 2017

Most Active Users - Yesterday
Anil_G
sonaliaggarwal47
ashokkarale
sharazkm32
Jon_Smith
Bal_Son
Nitesh
Marielle_Timajo_Apay_NCS
adi.mehare
Youri98
More details...

13.RPA Challenge - PDF Scrapping

Related topics