Finding text


#1

Hello guys,

can you please help me on OCR and finding a text?

Lets say I scan pdf and I want to find some information like date, ID and so on. What is the best practice?

So far I use Substring. So I convert OCR text to String, then I find indexOf word and substring the text for example ID = OCRoutput.SubString(IndexOfID+2, 10) as I know that IDs lenght is 10. What if we dont know the exact lenght of the integer? Is there a way how to use substring but, the second argument wont be its lenght but for example first space?

Thanks a lot.


#2

Hi,
If you don’t want to use indexing and substring then You could make use of relative scraping for each field like date,ID…so on.


#3

But every pdf is different. It can be problem. So there is no way how to set that the lenght of the string would end by the first space?


#4

Yep its dynamic then relative scrape won’t work.
you can make use of split string.
str.split(" ");
or
string newString = myString.Substring(myString.IndexOf(’ ') + 1);
For reference.


#5

Thanks a lot. Partialy it helped.