Pdf Extraction from different PDFs

Hi,
I have multiple pdfs with completely different structure .I want to extract some specific data which comes under different key words in different pdfs in different position for eg: in first pdf its “distance in mile:679” but in second one its “miles travelled:360”. number of pdfs are more than 100 and which contain completely different key words like this.I need to extract that number from it whatever the key word is. We have already tried with Regex but its complicated because of different keywords.

Any idea will be appreciated.
Thanks

1 Like

Hello @jeena.reji, and welcome!

This is a case in which Either a complex regex might help, or a custom extractor you can build based on the rules you can identify in your documents.

Options I see off the top of my head:

  1. regex: your examples show you have a number you need to capture, that contains a variation of “miles” somewhere next to it. Try capturing this in a few regex expressions. (Pls note you can use regex extraction multiple times in the same scope, with the purpose of ensuring fallback methods!)

  2. build a custom extractor based on rules: find word miles, find close-by numbers (positionally), decide on best match.

  3. build a machine learning model to learn these positions, if you have the luxury of many sample annotated files :slight_smile: this would need training, and a consuming activity of course.

Hope this helps,

Ioana

2 Likes

@jeena.reji I am just suggesting solution. Check if it suits you.

  1. Store all key words in excel column and then use for each loop to go through all keywords one by one.

Example: pdfText is variable which contains pdf data.
dt is datatable variable which contains all keywords.

For each datarow in dt - row
— if(pdtText.contains(row(0).ToString))
---- Then part
use any string manipulation/regex method to get required text.
---- Else part
do nothing

Above might be useful just try once. If new keyword requirement comes also, you just need to add new text in excel. There is no need to change the code .

@Manjuts90
It seems useful but the problem is position of that number will be different from the key word like “distance travelled:789” in other “miles travelled”.
680

@jeena.reji Can you provide few more examples? i didn’t get which position you are refering to.

Capture1 @Manjuts90

Capture2

Is there anybody used itextsharp for extracting this kind of data??if yes please help with that too…

@jeena.reji I am unable to see your images. Can paste the data in text format

Distance in miles:180
Miles Travelled
…180

@jeena.reji For extracting data from different positions like above will be difficult. I don’t know how to do it. Some other person may help you in forum.

1 Like

@jeena.reji For below 2 data i have created sample workflow. It is just idea how we can extract the data.

Diff.xaml (12.3 KB)