How to extract sentances from PDF(not lines)

Hi

I need to extract sentences from pdf, how can i do that? I am not too familier with Regex.

I tried the following expreasion and it extracts lines and not the sentances.

System.Text.RegularExpressions.Regex.Split(wordOutput, “\r\n|\r|\n”).Where(Function(x) Not String.IsNullOrWhiteSpace(x)).ToArray

You’re splitting it up by carriage return/line feed. So that’s why you get lines. Since sentences end with a period you want to split it up by periods.

@postwick

The sentences doenst have periods

Also, the same regex expression works for word and not the pdf. I am confused why this is happening only for pdf.

I mean I have the same document in both word and pdf. The same regex works for word(its extracting the sentences) but not pdf(it extracts lines)

Then how do you know it’s a sentence? How do you define sentence?

It’s working differently because the position of the linefeeds is different between the two document types.

@postwick

I am not exactly sure, somebody else helped me with that regex and its working for word sentance breaking. But not for pdf.

The typical definition of sentence is that it ends in a period. This is just basic grammar. So you have a different definition for sentence. It sounds like you need to define your requirements in more detail to determine how to break it up.

@Krithi1

It worked in word because generally when text is readfrom work space and new line are considered differently even if visually the sentence is in nect line the sentence still would be considered single line if a line break is not used

But for pdf as the text is read differently each line visually seen is considered as one line

So now coming to regex you need a period(.) to know the end of line else it would be difficult to break by lines

Alternately please attach a sample pdf here may be we can check if there is anything else that can be used

Cheers