Hi guys, I am stuck with something. I have a pdf with numerous pages and each page have datas which may vary from page to page. Each data have headings which is the only thing fixed in the pdf. The position keeps on varying. Even if we scrape it may work for some pages, but not completely. So I am looking for a general solution to scrape data when position keeps on varying.
Lets say I need to scrap data in between 2 points A and B. The size of the data may vary, it maybe a one liner or more.
Have you used “Read Pdf text” . You can filter the string text and obtain the required text with the help of Regex I guess. But being unstructured data, It might be complex. Give a try and let me know if you are able to get the result.
Yeah tried this, but how come we can give the required text that it matches with the scraped one? The required text is not a fixed one. It changes from page to page. I tried using regex and Is matches but no luck.This may works for small scenario like if the scrapped data is only numbers or we know the category of the scraped data that is is string, integer etc.
Hi,
you have mentioned heading is fixed, you can get it from substring in between those two headings which you want to extract right, will that work.? since you will be passing the length as dynamic(using index of first element till the index of second element) even though the length vary this should work.
so you are suggesting to scrap the data with top and bottom headings and then do string manipulation on this right? Then also the length varies from next page onwards I guess.
Passing the length as dynamic Can you please explain how it is done? I am attaching the selector below
length you dont need to give as static, this will be dynamic if you use with the length and index
example : (output.IndexOf(“Example 5: Distance and Time”)-output.IndexOf(“Speed (mph)”))
for this the “output” text must contain the 2 headings right? But still there is one thing that is remaining mysterious for me. Lets say I scrapped the data(data is just a one liner) between 2 headings and did string manipulation as u proposed above and got the data.
What if from 2nd page onwards data size is increased which will make the headings move below a bit which will be different from the initial coordinates given. How to do this dynamically? Like scrap everything between these two headings in any situation.
Hello, due to some reasons I cannot share that, but I can share a dummy pdf with my requirement.
The pdf contains many resumes and each resume is different. I need to get the NAME, SUMMARY OF QUALIFICATIONS, EDUCATION,PROFESSIONAL ACCOMPLISHMENTS etc for each resume
.resume-samples.pdf (294.7 KB)
The pdf contains many resumes and each resume is different. I need to get the NAME, SUMMARY OF QUALIFICATIONS, EDUCATION,PROFESSIONAL ACCOMPLISHMENTS etc for each resume
As per my knowledge, it’s difficult to extract data from the sample you provided as each resume is different. It’s difficult to split text on the basis of header text as it will change in other resume.
Human brain can recognize headings with the help of font and bold property of text. I don’t know how to implement this in your case, you can try to get frequency of all bold words with their font size.
A resume generally have minimum 5 headings, so you can get the font size where minimum word frequency is 5 and split your text on the basis of those words. You can write code with the help of iTextSharp or any other library and change word frequency according to your need.
Thanks
I have extracted the whole pdf to text and I am having a demo text file which needs to be processed. I am attaching it below, please do have a look guys.
Requirements: I need to get the data between *parts and *specs similarly the data between *specs and *price The size of data in between may vary
tried activities are attached belowdemo.zip (2.1 KB)
Hi, please find the below solution, here there are two array you can neglect the first element in both the array for the final result, and this will work for dynamic values inbetween these text. demo.xaml (7.6 KB)
here is the solution for output1 you can follow the same for output2 as well this will work even the lines or the values are dynamic in between Display and Storage,.