Find an approximate sub-string inside a large string

Hello everyone, I am trying to figure out how to do the following. Right now I am reading in each pdf page from a file into a string variable and using the .contains(“my string”) to make sure that the pdf page contains a certain string. But this requires that the pdf is scanned in a very high resolution. I want to be able to look through the page and find an approximation of “my string”, for example if it saw “my.string” or “mystr’ng” or “/mystring” due to errors of the OCR reader I still want it to accept these. I was looking into the Levenshtein Algorithm but this compares the entire string pulled from the PDF to “mystring” instead of checking inside. I could also try and find some RegEx expression that might work but that seemed difficult due to the number of errors that it has to account for. I was wondering if there was an easier way to search a large string for an approximate sub-string match.

This sounds complex so my Regex idea might not work but if your errors are simple like in your example with only one character wrong it should.

If your string is constant also it could work. Shorter the better too :slight_smile:

Let’s say the string you are looking for is the phrase “my string”. You could replace a single word character with a ‘*’ and it will capture “my s/ring” or even “my str\ng”.

Try this Regex out:
y string | m string | my tring | my sring | my sting | my strng | my strig | my strin

The regex is looking for any of these scenarios.

You could expand this again by having some letters missing after the *.

Not perfect but might be good enough for your scenario :slight_smile:

1 Like

Thank you for this, unfortunately for this project I was unsure of the number of errors that could be in the mix or their location due to the nature of OCR readers. I ended up playing around with the Levenshtein Algorithm some more and was able to create a program that looked for approximate strings between a user defined percent and 100 to determine how “fuzzy” of a match they wanted. Again, unfortunately this was at a large performance cost and came with a couple other issues. When possible, the best solution to poor OCR reading and recognition, is either to use better OCR software or a higher resolution scan of the file being read.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.