I used UiPath Document OCR for reading text from pdf files. The result is pretty accurate, but there is no line break in the output. All text are combined in a big chuck of text. How can I make sure the OCR also capture line breaks?
I have tried OmniPage OCR, screen OCR, Tesseract OCR, either the result is not accurate enough or doesn’t read line breaks.
I have also tried use “Split Text” afterwards, but it couldn’t be split using the separator “new line”.
My end goal is to provide a text file of extracted text from the pdf. I have obtained the text file, would like the text file to follow the line breaks in the pdf
Someone may know more than me, but from my experience, I haven’t had much luck preserving format with OCR.
If this document you are getting is standard, then you could either use text manipulation (i.e left,right,instr) or regex to extract the data and then rebuild the file.
If this is an option and you aren’t familiar with these techniques, I would be happy to help.