Hi,
I’m working on a project where I need to extract text from PDF files using UiPath Document OCR. Each PDF has a varying number of pages, and every page contains a header and footer section with repeated content that I want to exclude from the final extracted text.
Since I’m using OCR, the output is unstructured and doesn’t retain the page layout clearly. What would be the best approach to identify and remove the header and footer content during or after extraction? Any suggestions or best practices would be appreciated.
If the header/footer content is static, use RegEx to replace that with empty string. For dynamic header/footer I don’t see any cleaner or reliable approach.