PDF extraction from unstructured format

I have different pdf files and need to extract PO number from the file
the label could be anything either PO number , voucher, number, purchase order number , po number could be only numbers or alphanumeric or it could include special characters, the pdf file format is not fixed. no of pages may vary the structure of the pdf could vary/
what are the different ways with which i can extract po number

The only reliable solution is ABBYY FlexiCapture.

You could try to use the ABBYY OCR activity inside UiPath and use regex to catch the PO numbers. If you can’t form a regex then its pretty much impossible. This method is less reliable because the extracted data could be in different formats

Do the PO numbers always begin with “PO: 1234-ABC”?

like i said if they always come in PO: format it would have been easy for us to take it using regex .
but it could be in different format like:

  1. PO Number theactualnumber
  2. PO Number : theactualnumber
  3. PO number po date
    theactualnumber date
    4.purchase order number
    the numbers could be in front or below the label

Yeah this is a problem that needs some advanced OCR. You can use anchoring and train it on multiple different formats. You cannot do that in UiPath OCR for now.

You can still build the regex method with a human verification at the end to make sure it is caught. I’m not sure if that will be beneficial enough for your problem though.

yeah it won’t be beneficial. but is this a drawback in only uipath or other automation tools too?

No RPA tools come with such an advanced OCR capability. I would say UiPath has the best built-in OCR capability for now.

I’ve used flexicapture but there is no community edition and licesences are very expensive. It does integrate with UiPath(or any other tool). And is a good solution if you can offset the cost with high volumes of documents(Think 1000’s of documents per day).


@jimmy.joseph any idea about UiPath.IntelligentOCR.Activities . i tried using that for pdf extraction but didn’t work as expected.


You can refer to this post for the functionallity