How to tackle with overlapping text in a Document?

Hi all, @Ioana_Gligan @Jeremy_Tederry
I’m facing a problem when I’m extracting text from a document where there is some overlapping text. e.g. a stamp near the text or on the text. In these scenario, what practice can be followed to extract the background text ?
And moreover also, if we want to extract the handwritten text ? what technique can be used to extract the handwritten text from an input document ?

Raheel Ahmed

Hi @raheelferoze !
If the document is not a native document (so the document is a scan or an image) then to read everything including the stamp you should use OCR.
To do so, you have several options:

  • if you already have bought an OCR (like ABBYY), then you should plug it to UiPath
  • if you did not buy an OCR, you can try to use pdf activity from UiPath which is free, but I don’t know about the output result:
    to find this activity you need to go to Manage packages, then look for pdf activities at the “All packages” level:
  • otherwise, you can use either free OCR (like Omnipage or the following ones:)

    or provided by UiPath OCR but using an API key (so have to be linked with Orchestrator with a number of limited pages to read)

→ if it’s a native pdf with native stamp (like timestamp), then you can use a simple read pdf (available in uipath package PDF activities)

Hi @Hiba_B , Well my scenario is a little different.
I’ve created a workflow by following the Document Understanding Framework in that I’ve created a custom ML extractor on AI center (using Data manager and Out of the box ML packages) which extracts information from different fields from an semi-structured document.

In my workflow, when it is started it gets input documents that are scanned pdf’s. and In those PDFs there’re some stamps causing the overlapping. And in that case the Bot retrieves null or wrong information.

Raheel Ahmed

Oh I see then it’s specific on the ML skills, my bad didn’t see the category AI Center

1 Like

Hello @raheelferoze ,

Unfortunately there is no clear way of “separating” text that is overlapping…
The only suggestion I can give you is to try a couple of different OCR engines - maybe one of them reports the characters in a better order / shape that can be interpreted closer to reality…

Are the stamps variable? or the same content? any chance you could write some rules to eliminate the extra characters? In cases of no value extracted, it would be a bit tricky…

1 Like

Hello @Ioana_Gligan , well, the stamps aren’t really variable, they’re mostly the same but they overlap in some case and in some cases they don’t. I’ve tried different OCRs but each OCR returns null results even if overlapping occurs or even if it is near to none. Just a little touch of the stamp and the OCR returns null value.

Raheel Ahmed

@Ioana_Gligan Plus, is there any way to read the handwritten text ? I mean does Google vision or Microsoft Azure OCR will read and understand the handwritten text from input documents ?