PDF Scraping - String Manipulation / Regex help

rridlen · October 25, 2019, 2:17pm

Using Enterprise 19.4.4

I’m also not a developer, so I likely need additional details/more through explanation of suggestions.

I’m trying to scrape a list of variable length from a PDF. There’s a general pattern to the data that needs to be extracted.

Red boxes - data to extract

I have been unable to figure out a way to repeat the data extraction process for multiple reports through string manipulation. Searching through the forums I saw someone recommend using Regex to another user trying to breakout structured text.

If anyone has a suggestion about how to best approach capturing this data it would be greatly appreciated.

Extracted text:

19-05-002
[!M N E _S!]_2
Reposition sprinkler heads in film room.
Currently, the sprinkler heads in the basement film room are several feet below the height of the ceiling.
The hazard with having the heads too far from the ceiling is that they will have a delayed activation since
heat rises. A contractor should be contacted to move the heads closer to the ceiling.
Loss Expectancies
(USD)
Exposure to Loss is approximately: 2,510,000 PD
Minimal BI
Exposure to Loss if Completed is approximately: 1,000,000 PD
Minimal BI
Cost Estimate: 8,000
Text Points Completion of only this recommendation will result in a Text score increase
of 1.71 Points.
Status text
[!MNE_E!]_2[#] 19-05-002
19-05-003
[!M N E _S!]_3
Document and improve the sprinkler control valve supervision program.
The following should be done as part of a regular automatic sprinkler control valve supervision program:
Weekly: Visually inspect all sprinkler control valves to verify that they are locked and in the open
position.
Quarterly: Test all sprinkler system waterflow alarms. Record the time to alarm.
These inspections and tests should be documented and saved for review by the next visiting
engineer.
Loss Expectancies
(USD)
Acting on this item would reduce the probability or severity of loss.
Exposure to Loss if Completed is approximately: Minimal PD
Minimal BI
Cost Estimate: 5,000
Text Points Completion of only this recommendation will result in a text score increase
of 3.07 Points.
Index: # / Account: # / Order ID: #
19-05-001
continued
FM Global Risk Report The Hearst Corporation
[ ! H D R ! ]
Status text
text
text
[!MNE_E!]_3[#] 19-05-003

rridlen · October 28, 2019, 3:47pm

Any guidance?

salladinne · October 28, 2019, 3:55pm

Hi @rridlen,

It’s hard to say what the pattern is, because you provided just one example - I understand that the headers “19-05-003”, “19-05-003” and the titles next to, “Reposition sprinkler heads in film room.”, “Document and improve the sprinkler control valve supervision program.” will always be in the scope of extraction, but what about the numeric values like 1.71? Will it always appear in such sentence, or will it appear only once?

In general, when it comes to the headers, rather than using Regex and String Manipulation of the Scraped PDF text, maybe consider exploring the selector values of the headers in UI Explorer? It’s possible that there will be some “header” parameter that will always extract the headers for you with simple Get Text/Copy Text activity.

rridlen · October 28, 2019, 4:04pm

1.71 will vary from entry to entry, but a number will always be in the segment between “increase of” and “Points”. Based on the example I was provided it appears to only happen once, but that’s not necessarily a hard rule.

I’ll try UI Explorer and seeing if I can find a parameter.

Thank you.

rridlen · October 28, 2019, 4:23pm

@salladinne

No luck with the UI Explorer. There are no unique selectors for any elements in the PDF. It’s just viewed as one big document.

salladinne · October 29, 2019, 11:54am

@rridlen

Ok, let’s try to solve the other thing than.
Whenever you data scrape the entire content of the PDF document, and the numeric value exists always between “increase of” and “Points”, you can do something like following logic:

lets say that BigString is string for the whole content of PDF and smallString will be your 1.71 or whichever else numeric value that can appear there
Length,FirstIndex and SecondIndex are assisting Int variables

Assign:
FirstIndex=BigString.IndexOf(“increase of”)
SecondIndex=BigString.IndexOf(“Points”))
Length=BigString.Substring(0,SecondIndex).Length-BigString.Substring(0,FirstIndex+1).Length

smallString = BigString.Substring(FirstIndex,Length)

Not sure if it will work perfectly, but I used it on some test cases and it did. Give it a try.

mmcruzRPA · October 29, 2019, 12:06pm

I suggest you to use Computer Vision, I think it’s a good solution for your case.

rridlen · November 26, 2019, 3:57pm

I managed to get the first few samples working completely with string manipulation. However the newest round of samples introduced a new variable I’m struggling to solve.

Currently:
RecommendationHeader = Split((Split(Rec1,“[!M N E _S!]_1”)(1).ToString),"."c)(0).ToString

Rec1:
[!M N E _S!]_1
Eliminate plastic crate storage along west wall of Building No. 16 or upgrade sprinkler protection.
Eliminate plastic crate storage or upgrade sprinkler protection in accordance with FM Global Property
Loss Prevention Data Sheet 8-9, Storage of Class 1, 2, 3, 4 and Plastic Commodities. A less optimal
alternate option is to store the plastic crates within the footprint of the adjacent baled waste paper storage
area in the northwest corner of Building No. 16. The sprinkler system of this area has been reinforced
and is adequate.

RecommendationHeader is returning a value of Eliminate plastic crate storage along west wall of Building No.
I’m looking to get Eliminate plastic crate storage along west wall of Building No. 16 or upgrade sprinkler protection.

I checked the other samples and the header info for every recommendation always comes after an [!M N E S!]* and always on the line immediately below it. Is there a good way to use RegEx to grab that line, or does someone have a recommendation to handle the Building No. issue?

Examples:
[!M N E _S!]_1
Increase the frequency of the sprinkler system inspection and testing program.

[!M N E _S!]_2
Enhance the current fire emergency response plan.

[!M N E _S!]_2
Expand the fire protection control valve inspection program.

[!M N E _S!]_2
Reduce roll paper storage height in the Building No. 1 press room.

[!M N E _S!]_3
Eliminate/minimize storage or improve sprinkler protection in Building No. 12.

rridlen · December 5, 2019, 4:52pm

If anyone ever runs into something similar:

System.Text.RegularExpressions.Regex.Match(strText,“(?<=[!M N E S!]\d)[\r\n]+(.*)”).Value.Trim

system · December 8, 2019, 4:52pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regex assistance Help pdf , data_scraping , regex , question	8	905	November 11, 2019
Need help with idea for string manipulation Activities pdf	7	1245	October 18, 2021
PDF Scrapping Field Value Data problem Academy Feedback	19	1909	June 10, 2019
Regex issue, text group extract Activities pdf , activities , regex , question , regex-extractor	6	954	February 2, 2022
Scrap unstructured data from PDF Help	25	6533	October 2, 2021

Most Active Users - Yesterday
Anil_G
ashokkarale
jinal.shah
Gautham_Pattabiraman
postwick
chandreshsinh.jadeja
vrdabberu
Ajay_Mishra
sven.wullum1
Vyshnavi_Nalumachu
More details...

PDF Scraping - String Manipulation / Regex help

Related Topics