I need assistance, please! Currently, I have a 20-pages compiled PDF file. The PDF consists of clients’ account termination forms. Every page is a unique client. I need to extract 1) Account Number (purely 6 or 7 digits) that appear after the text “Account No.” and 2) Customer Identification Number (A text with the standard format - e.g. S12367648D). Please assist me on the step-by-step process.
Read PDF text then split by page using NewPage marker or fixed structure then use Matches to extract \d{6,7} after Account No. and S\d{7}[A-Z] for ID then loop pages store data in DataTable then write to Excel.
Hi @prashant1603765. Thanks for the reply! Currently, the PDF is unstructured format (scanned PDF). I am using “Read PDF with OCR” activity with “OmniPage OCR” engine. Can you tell me the names of the “Split” activity and “Matches” activity?
For extracting data, you can use the “Matches” activity to capture the patterns like \d{6,7} after “Account No.” and S\d{7}[A-Z] for the ID.
summarize:
- Use “Read PDF Text” with OCR to get the text.
- Split the text manually using the “Matches” activity or Assign activity.
- Use “Matches” activity again to extract the required patterns (Account No. and ID).
- Store the results in a DataTable and write to Excel.
If you found helpful, tick as a solution.
Thanks