How to Remove Header and Footer Text from OCR-Extracted PDF Using UiPath Document OCR?

Hi,
I’m working on a project where I need to extract text from PDF files using UiPath Document OCR. Each PDF has a varying number of pages, and every page contains a header and footer section with repeated content that I want to exclude from the final extracted text.

Since I’m using OCR, the output is unstructured and doesn’t retain the page layout clearly. What would be the best approach to identify and remove the header and footer content during or after extraction? Any suggestions or best practices would be appreciated.

Thanks in advance!

@jai_kumar2,

If the header/footer content is static, use RegEx to replace that with empty string. For dynamic header/footer I don’t see any cleaner or reliable approach.

You can try below approach

  • Split OCR text with Environment.NewLine
  • remove first and last item (i.e header and footer)
  • join the array for whole OCR content.

Here’s what can you try with;

  1. Read PDF Page by Page
    Use Read PDF with OCR in a loop, one page at a time, to maintain page boundaries.
  2. Store Each Page’s Text
    Save text from each page into a list or array.
  3. Split Text into Lines
    Use Split(Environment.NewLine) to break each page into lines.
  4. Identify Repeating Header/Footer Lines
  • Check top/bottom 1–3 lines per page.
  • Track line frequency across all pages.
  • Lines that appear in most pages are likely headers/footers.
  1. Remove Repeated Lines
    Filter out lines that are common across pages.
  2. Recombine Cleaned Text
    Merge non-header/footer lines from all pages into your final output.