13.RPA Challenge - PDF Scrapping


#1

Given the sample dummy PDF attached the output should be
RPA dummy sample form .pdf (231.0 KB)

Unit: 423
1: Verify
2: Start
3: Stop
4: n/a
5: Start

Basically I need extract what squares are checked per each line.


How to Get the Value from check box if the check box is checked we need to get that value
Challenge 13
Reading CheckBox in editable PDF
#2

Few things:

  1. Did Image automation since I was unable to do element based and I didn’t wanted to do Get text
  2. Not an efficient solution
  3. You need to adjust your pdf and 2nd click Offset (resolution sepcific), so robot can see the check boxes
  4. Coded only for Verify checkbox for your understanding.
  5. Other might have a better solution so use this as an alternative.

RPA13.zip (1003.0 KB)


#3

Hi!

I’ve used the idea of vvaidya and I’ve generalized it such that we obtain the desired output. In order to output the values in an ordered manner I’ve changed a little bit the pdf’s content. Challenge13_PDFScrapping.zip (198.6 KB)


#5

All this is staged so others can take up the challenge(s) seriously? :wink::wink:


#6

Apart from seasoned guys like @badita @andrzej.kniola @richarddenton @aksh1yadav @acaciomelo who did/doing tremendous job in bringing this forum to next level.

@ddpadil @ClaytonM @Florent_Salendres @sfranzen are doing really great work helping out the people. Hope others get inspired as well and contribute their knowledge to the forum.

Just thought of appreciating you guys for your work. Keep up the great work! Keep rocking!


Multiple choice PDF ocr
#7

I realized that ‘my solution’ wasn’t actually a valid one. It was just a poor adaptation of the vvaidya’s method. I hope that someone will really solve this.


#8

Seems my contribution is not being noticed, haha just kidding… Thanks all that contributed in any level.


#9

Who is this @beesheep guy? Anyone heard of him?! haha.

I just ask the annoying questions and line them up for you guys who actually know what you’re talking about :smiley:


#10

It did not go unnoticed.Its just based on quarterly results (I see the above guys in every other thread)

https://forum.uipath.com/u?period=quarterly


#11

Hello,

first of, the challenge is also to understand if the pdf is standard or not… so here are a few warnings before executing…

  1. if you can get the unit number from elsewhere use that one.
  2. I had to save the PDF as pdf (weird right) this was in order not to lose the data where I was able to determine the checkbox status.

3 this has to be opened in Chrome. because I have no pdf client in my pc… lol
4 Perhaps you need to re scrap the area.

How did I came to a conclusion, well this is how.

I screen scrap the pdf in chrome, this is the result.

El START I] STOP VERlFY
START [3 STOP :1 VERlFY
I] START STOP C] VERlFY
[j START [1 STOP VERlFY
START I: STOP [1 VERlFY
[I START [:1 STOP C] VERlFY
Cl START I: STOP I: VERlFY
START III STOP CI VERlFY
El START I] STOP :1 VERlFY
El START [3 STOP El VERlFY

So if you take a closer look every checkbox marked, does not show any weird string, in other words the one preced for the posible conditions(start , stop, verify) is the one that is checked… please compare the list with the actual dummy pdf @badita uploaded.

From there it was easy split the entire text into lines, then each line it has to be splitted using " " and then check if the length of the string has 5 or 6, for instance the lenght is 6 means none of the boxes was selected otherwise one was selected.

so if start was selected no string is before the actual word, so the split array with the word “start” selected will be 0, if “stop” was selected this means that the array in this line is 2 (because a weird string is 0 and start is 1 and stop is 2) and it has to be stop. and so on.

the rest of building a data table and adding row, append activity the reader can find information here in the forum.

this is the result:
image

hope this helps,
13.RPA Challenge - PDF Scrapping.zip (1.0 MB)


#12

@ddpadil - you should also give it a try. I hope you have something ? :slight_smile:

and @certified - please try to attempt things at least. Don’t bother about just present yourself. may be your logic will be best to deal with presented scenarios :slight_smile: .

Hoping for all your replies now onward where you think you can help or advice something better :slight_smile:

Regards…!!
Aksh


#13

Good Idea @aksh1yadav, thank you for encouraging all of us…


#14

Here we go @aksh1yadav is back as always dragging me in .:stuck_out_tongue_winking_eye:
Scrolled down the page till the end but couldn’t see you solution file . :joy::joy:.


#15

It is not about dragging pal. if you think so will not do it again…sorry…!!

Just wanted to best out from you but ok won’t do :wink:

Because i am Rookie :slight_smile:


#16

Really …:stuck_out_tongue:
come man we’re like best pals in forum.
Just messing with you.
Tag me in any post np.
Your the one of the reasons why i’m here
learned lot from you man. :slight_smile:
Thanks sharing knowledge :pray:

i said this before and i say this again @aksh1yadav your legend.:+1:
#FallOfFame.
May be we should cut the chase and let other discuss on the post
else we’ll be flagged by us only :smile:


#17

@badita Perhaps the moderators should have a Bromance flag?! :smiley: :blush:


#18

LOL :joy::joy:@aksh1yadav


#19

hahah … :thinking:

Well rich then i’ll use it for you as well :stuck_out_tongue: :wink:


#20

my first comment


#21

I tried scraping the entire page (iexplorer, I know that beesheep did it in chrome but I wanted to try and make it a bit more robust for me as I’ve had so many issues with OCR scraping) and that was just useless so much data was not scraped especially the bits you actually needed. I thought it was better to find an image in each row of data and then just search for the tick boxes in an area in that row but I’ve had some issues with that approach

(I put it in rookies as this topic was last updated ages ago and I think I need some help at this point)