Scrap unstructured data from PDF


#1

Hi guys, I am stuck with something. I have a pdf with numerous pages and each page have datas which may vary from page to page. Each data have headings which is the only thing fixed in the pdf. The position keeps on varying. Even if we scrape it may work for some pages, but not completely. So I am looking for a general solution to scrape data when position keeps on varying.

Lets say I need to scrap data in between 2 points A and B. The size of the data may vary, it maybe a one liner or more.

Methods used: Relative Screen Scrapping

With regards
Amith


#2

Have you used “Read Pdf text” . You can filter the string text and obtain the required text with the help of Regex I guess. But being unstructured data, It might be complex. Give a try and let me know if you are able to get the result.


#3

Yeah tried this, but how come we can give the required text that it matches with the scraped one? The required text is not a fixed one. It changes from page to page. I tried using regex and Is matches but no luck.This may works for small scenario like if the scrapped data is only numbers or we know the category of the scraped data that is is string, integer etc.


#4

Hi,
you have mentioned heading is fixed, you can get it from substring in between those two headings which you want to extract right, will that work.? since you will be passing the length as dynamic(using index of first element till the index of second element) even though the length vary this should work.


#5

so you are suggesting to scrap the data with top and bottom headings and then do string manipulation on this right? Then also the length varies from next page onwards I guess.

Passing the length as dynamic Can you please explain how it is done? I am attaching the selector below


#6

length you dont need to give as static, this will be dynamic if you use with the length and index
example : (output.IndexOf(“Example 5: Distance and Time”)-output.IndexOf(“Speed (mph)”))


#7

for this the “output” text must contain the 2 headings right? But still there is one thing that is remaining mysterious for me. Lets say I scrapped the data(data is just a one liner) between 2 headings and did string manipulation as u proposed above and got the data.

What if from 2nd page onwards data size is increased which will make the headings move below a bit which will be different from the initial coordinates given. How to do this dynamically? Like scrap everything between these two headings in any situation.


#8

can you share the sample data? i can tell you how to do that


#9

Hi- Try this out.

  • Read the entire pdf with OCR(use read pdf with OCR activity) and store the result in one output variable.

  • Apply Regex on the output Variable(Use matches activity) to get the required text.

Attached is the one sample example for your reference OCR.xaml (7.3 KB)

Hope this helps!.

Thank you,
Nitesh


#10

Hello, due to some reasons I cannot share that, but I can share a dummy pdf with my requirement.

The pdf contains many resumes and each resume is different. I need to get the NAME, SUMMARY OF QUALIFICATIONS, EDUCATION,PROFESSIONAL ACCOMPLISHMENTS etc for each resume
.resume-samples.pdf (294.7 KB)


#11

Hi will check this. Meanwhile could you check this

resume-samples.pdf (294.7 KB)
A dummy pdf with my requirement.

The pdf contains many resumes and each resume is different. I need to get the NAME, SUMMARY OF QUALIFICATIONS, EDUCATION,PROFESSIONAL ACCOMPLISHMENTS etc for each resume


#12

Hi @amithvs,

As per my knowledge, it’s difficult to extract data from the sample you provided as each resume is different. It’s difficult to split text on the basis of header text as it will change in other resume.

Human brain can recognize headings with the help of font and bold property of text. I don’t know how to implement this in your case, you can try to get frequency of all bold words with their font size.

A resume generally have minimum 5 headings, so you can get the font size where minimum word frequency is 5 and split your text on the basis of those words. You can write code with the help of iTextSharp or any other library and change word frequency according to your need.
Thanks


#13

will try this bro. thanks


#14

If the pdf format doesn’t change then try this idea(Splitting the file)-

  1. read pdf text with ocr and provide range “1-17”,“17-34”.“34-50”.You can store the data in different files.

  2. from these file extract the required information like NAME, SUMMARY OF QUALIFICATIONS, EDUCATION,PROFESSIONAL ACCOMPLISHMENTS etc

Thanks,
Nitesh


#15

@niteshn @Divyashreem @Bharat

I have extracted the whole pdf to text and I am having a demo text file which needs to be processed. I am attaching it below, please do have a look guys.

Requirements: I need to get the data between *parts and *specs similarly the data between *specs and *price
The size of data in between may vary

tried activities are attached belowdemo.zip (2.1 KB)


#16

Hi, please find the below solution, here there are two array you can neglect the first element in both the array for the final result, and this will work for dynamic values inbetween these text.
demo.xaml (7.6 KB)


#17

thanks it is working fine with the demo doc. But in the real scenario the output I am getting is a GAP(blank space).

real scenario doc example

redmi note5 pro

Display: 5.55"
FHD
18:9 ratio
Storage: 4gb ram
64gb rom
external card not supported
Battery: 3300mah
li-ion

Required output:

output1: 5.55"
FHD
18:9 ratio

output2: 4gb ram
64gb rom
external card not supported (data in between Storage and Battery)demo2.zip (251 Bytes)


#18

Try this,

demo.xaml (6.1 KB)

here is the solution for output1 you can follow the same for output2 as well this will work even the lines or the values are dynamic in between Display and Storage,.


#19

thanks will check this. What is 8 btw?


#20

Have you tried with regex ?