I have pdf file in which I scrape the entire document into a Text String. I then parse out the data by Labels so that I can capture the Field Information taken from the pdf. Example of field captures: Company ID, Start Date, End Date, Address, Phone, Email, etc. For the most part my string manipulation and capture process is flawless as long as the “Labels” can be defined as unique.
Unfortunately, this pdf has 4 different labels all titled: "Address: " The problem I am running into is that even though I specifically identify where each “Address:” is located within the text stream, the robot stops at the first one and then captures the wrong information. I have been using Google OCR to do relative scraping, but that process is sketchy at best.
I have searched through the community help and tried several things I found, but nothing specific to resolving my issue. What is happening currently when I attempt to use
System.Text.RegularExpressions.Regex.Replace is replaces all the “Address:” labels with “Address_1”.
What I would like to do (if possible) is to iterate through the Text Stream, Find the First instance of “Address:” and replace it with "Address_1; 2nd Instance with Address_2, 3rd Instance with Address_3, etc. Until all 4 instances of the word have been renamed. In so doing I can created a unique index by which my string capture will be able to find the correct addresses to pull data from.
Does anyone know if this can be done within a Text File?
Thank you in advance for your help!