In my process, I’m searching for a word in a PDF. I then want to extract the sentence that it appears in. Every sentence is different though and I have no idea how to extract the sentence, as I can’t, for example, tell it to extract the text between “A” and “B”.
e.g. if the word “information” is being searched, these are the types of sentences it would bring up:
This may include for example information regarding the location of buried utilities such as water or gas lines.
The contractor should use the Client specified list of formats for all information deliverables (unless otherwise stated).
Select data formats to be used for project information delivery should include; Graphical, Non-Graphical and Documentation.
1-First Try to split your text to sentences.
2-Then use loop over array of sentences
3-inside loop check word (information) through contains
4-if contains you can get easily by index of array
Have you tried toggling the PreserveFormat property and checked the results ?
There are times when we have data in the table format and we might need to adjust the table into the format we require first and then do the operation. But this also would require us to know if the the data is in a particular format for all the pdf files that you would receive.
Alternately, we could help with regex expressions if we have a view on the sample data/ if it is sure that all the data follows a particular pattern.
PreserveFotmat doesn’t change much, and the PDF will be sent from different companies meaning the format will never be the same unfortunately. Apparently it’s rare that the PDF has table formats in like the one I tested so I don’t think it’ll be an issue.
One last question, sorry! At the moment it’s picking up the line “resource” is on, is there a way to get the text from number to number? i.e. get all the text from 41.2 to 41.3, though the numbers will change each time