Hi,
I have multiple pdfs with completely different structure .I want to extract some specific data which comes under different key words in different pdfs in different position for eg: in first pdf its “distance in mile:679” but in second one its “miles travelled:360”. number of pdfs are more than 100 and which contain completely different key words like this.I need to extract that number from it whatever the key word is. We have already tried with Regex but its complicated because of different keywords.
This is a case in which Either a complex regex might help, or a custom extractor you can build based on the rules you can identify in your documents.
Options I see off the top of my head:
regex: your examples show you have a number you need to capture, that contains a variation of “miles” somewhere next to it. Try capturing this in a few regex expressions. (Pls note you can use regex extraction multiple times in the same scope, with the purpose of ensuring fallback methods!)
build a custom extractor based on rules: find word miles, find close-by numbers (positionally), decide on best match.
build a machine learning model to learn these positions, if you have the luxury of many sample annotated files this would need training, and a consuming activity of course.
@jeena.reji I am just suggesting solution. Check if it suits you.
Store all key words in excel column and then use for each loop to go through all keywords one by one.
Example: pdfText is variable which contains pdf data.
dt is datatable variable which contains all keywords.
For each datarow in dt - row
— if(pdtText.contains(row(0).ToString))
---- Then part
use any string manipulation/regex method to get required text.
---- Else part
do nothing
Above might be useful just try once. If new keyword requirement comes also, you just need to add new text in excel. There is no need to change the code .
@Manjuts90
It seems useful but the problem is position of that number will be different from the key word like “distance travelled:789” in other “miles travelled”.
680
@jeena.reji For extracting data from different positions like above will be difficult. I don’t know how to do it. Some other person may help you in forum.