PDF Scraping - String Manipulation / Regex help

Using Enterprise 19.4.4

I’m also not a developer, so I likely need additional details/more through explanation of suggestions.

I’m trying to scrape a list of variable length from a PDF. There’s a general pattern to the data that needs to be extracted.

Red boxes - data to extract

I have been unable to figure out a way to repeat the data extraction process for multiple reports through string manipulation. Searching through the forums I saw someone recommend using Regex to another user trying to breakout structured text.

If anyone has a suggestion about how to best approach capturing this data it would be greatly appreciated.

Extracted text:

[!M N E _S!]_2
Reposition sprinkler heads in film room.
Currently, the sprinkler heads in the basement film room are several feet below the height of the ceiling.
The hazard with having the heads too far from the ceiling is that they will have a delayed activation since
heat rises. A contractor should be contacted to move the heads closer to the ceiling.
Loss Expectancies
Exposure to Loss is approximately: 2,510,000 PD
Minimal BI
Exposure to Loss if Completed is approximately: 1,000,000 PD
Minimal BI
Cost Estimate: 8,000
Text Points Completion of only this recommendation will result in a Text score increase
of 1.71 Points.
Status text
[!MNE_E!]_2[#] 19-05-002
[!M N E _S!]_3
Document and improve the sprinkler control valve supervision program.
The following should be done as part of a regular automatic sprinkler control valve supervision program:
Weekly: Visually inspect all sprinkler control valves to verify that they are locked and in the open
Quarterly: Test all sprinkler system waterflow alarms. Record the time to alarm.
These inspections and tests should be documented and saved for review by the next visiting
Loss Expectancies
Acting on this item would reduce the probability or severity of loss.
Exposure to Loss if Completed is approximately: Minimal PD
Minimal BI
Cost Estimate: 5,000
Text Points Completion of only this recommendation will result in a text score increase
of 3.07 Points.
Index: # / Account: # / Order ID: #
FM Global Risk Report The Hearst Corporation
[ ! H D R ! ]
Status text
[!MNE_E!]_3[#] 19-05-003

Any guidance?

Hi @rridlen,

It’s hard to say what the pattern is, because you provided just one example - I understand that the headers “19-05-003”, “19-05-003” and the titles next to, “Reposition sprinkler heads in film room.”, “Document and improve the sprinkler control valve supervision program.” will always be in the scope of extraction, but what about the numeric values like 1.71? Will it always appear in such sentence, or will it appear only once?

In general, when it comes to the headers, rather than using Regex and String Manipulation of the Scraped PDF text, maybe consider exploring the selector values of the headers in UI Explorer? It’s possible that there will be some “header” parameter that will always extract the headers for you with simple Get Text/Copy Text activity.

1.71 will vary from entry to entry, but a number will always be in the segment between “increase of” and “Points”. Based on the example I was provided it appears to only happen once, but that’s not necessarily a hard rule.

I’ll try UI Explorer and seeing if I can find a parameter.

Thank you.


No luck with the UI Explorer. There are no unique selectors for any elements in the PDF. It’s just viewed as one big document.


Ok, let’s try to solve the other thing than.
Whenever you data scrape the entire content of the PDF document, and the numeric value exists always between “increase of” and “Points”, you can do something like following logic:

lets say that BigString is string for the whole content of PDF and smallString will be your 1.71 or whichever else numeric value that can appear there
Length,FirstIndex and SecondIndex are assisting Int variables

FirstIndex=BigString.IndexOf(“increase of”)

smallString = BigString.Substring(FirstIndex,Length)

Not sure if it will work perfectly, but I used it on some test cases and it did. Give it a try.

I suggest you to use Computer Vision, I think it’s a good solution for your case.

I managed to get the first few samples working completely with string manipulation. However the newest round of samples introduced a new variable I’m struggling to solve.

RecommendationHeader = Split((Split(Rec1,"[!M N E _S!]_1")(1).ToString),"."c)(0).ToString

[!M N E _S!]_1
Eliminate plastic crate storage along west wall of Building No. 16 or upgrade sprinkler protection.
Eliminate plastic crate storage or upgrade sprinkler protection in accordance with FM Global Property
Loss Prevention Data Sheet 8-9, Storage of Class 1, 2, 3, 4 and Plastic Commodities. A less optimal
alternate option is to store the plastic crates within the footprint of the adjacent baled waste paper storage
area in the northwest corner of Building No. 16. The sprinkler system of this area has been reinforced
and is adequate.

  • RecommendationHeader is returning a value of Eliminate plastic crate storage along west wall of Building No.
  • I’m looking to get Eliminate plastic crate storage along west wall of Building No. 16 or upgrade sprinkler protection.

I checked the other samples and the header info for every recommendation always comes after an [!M N E S!]* and always on the line immediately below it. Is there a good way to use RegEx to grab that line, or does someone have a recommendation to handle the Building No. issue?

[!M N E _S!]_1
Increase the frequency of the sprinkler system inspection and testing program.

[!M N E _S!]_2
Enhance the current fire emergency response plan.

[!M N E _S!]_2
Expand the fire protection control valve inspection program.

[!M N E _S!]_2
Reduce roll paper storage height in the Building No. 1 press room.

[!M N E _S!]_3
Eliminate/minimize storage or improve sprinkler protection in Building No. 12.

If anyone ever runs into something similar:

System.Text.RegularExpressions.Regex.Match(strText,"(?<=[!M N E S!]\d)[\r\n]+(.*)").Value.Trim

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.