How to extract sentances from PDF(not lines)

Krithi1 · June 17, 2025, 3:52pm

Hi

I need to extract sentences from pdf, how can i do that? I am not too familier with Regex.

I tried the following expreasion and it extracts lines and not the sentances.

System.Text.RegularExpressions.Regex.Split(wordOutput, “\r\n|\r|\n”).Where(Function(x) Not String.IsNullOrWhiteSpace(x)).ToArray

postwick · June 17, 2025, 4:47pm

You’re splitting it up by carriage return/line feed. So that’s why you get lines. Since sentences end with a period you want to split it up by periods.

Krithi1 · June 17, 2025, 4:59pm

@postwick

The sentences doenst have periods

Also, the same regex expression works for word and not the pdf. I am confused why this is happening only for pdf.

I mean I have the same document in both word and pdf. The same regex works for word(its extracting the sentences) but not pdf(it extracts lines)

postwick · June 17, 2025, 6:00pm

Then how do you know it’s a sentence? How do you define sentence?

It’s working differently because the position of the linefeeds is different between the two document types.

Krithi1 · June 17, 2025, 6:06pm

@postwick

I am not exactly sure, somebody else helped me with that regex and its working for word sentance breaking. But not for pdf.

postwick · June 17, 2025, 6:13pm

The typical definition of sentence is that it ends in a period. This is just basic grammar. So you have a different definition for sentence. It sounds like you need to define your requirements in more detail to determine how to break it up.

Anil_G · June 18, 2025, 3:54am

@Krithi1

It worked in word because generally when text is readfrom work space and new line are considered differently even if visually the sentence is in nect line the sentence still would be considered single line if a line break is not used

But for pdf as the text is read differently each line visually seen is considered as one line

So now coming to regex you need a period(.) to know the end of line else it would be difficult to break by lines

Alternately please attach a sample pdf here may be we can check if there is anything else that can be used

Cheers

Topic		Replies	Views
Extracting sentence with set word in it Studio studio , question , data-extraction	8	1831	October 2, 2022
Regex advice Studio studio , regex , question , string-manipulation	6	965	October 12, 2022
Regex code to find string seperated by a line break Activities pdf , activities , regex	6	1436	August 23, 2022
Impossibile to use Regex with Pdf Acitivity Studio studio , regex , question , tools , pdf-extraction	4	815	January 24, 2022
How to Get text from PDF if it is in multiple lines Studio pdf , activities , studio , question	7	1695	October 14, 2021

How to extract sentances from PDF(not lines)

Related topics