Wrong values returned for native PDF files

Hello!

I am following this guide:

Actually, I am even using the attached example to see how good these new features are on some of our own sample invoices. Some of the invoices I have are scanned ones, but most of them are native PDF files, where you can directly copy-paste any text you want. I understand, there might be low confidence and recognition mistakes for scan copies, but how is it possible to return a wrong value from a native PDF file?

So, from the attached workflow - activity Digitize Document returns text in text variable. It has correct values, but after Machine Learning extractor, Validation Station pops up with the wrong value. Please see attached screenshots.

Is there a way to fix it?

Thanks!

@cubx , Same issue and did you get any exceptions while you validation in validation station like PDF page limit ?

How to handle exceptions in validation station, my BOT not going forward if there is any exception from validation station.

Hey @harsha_vardhan!

I believe you can upload max 2 pages atm, it was somewhere in the description of the service, thus I haven’t tried uploading more than 2 pages. For the PDF files I uploaded I never had any exceptions, I am not sure why your bot isn’t going forward if you handling the exception correctly. Do you use the Validation Station activity within Try Catch activity?

The only exception validation station raised for me is when you close the window by clicking the X button which is easy to handle.

Hello @loginerror ! Sorry for tagging you, but I am not really sure how to get the attention of the relevant person to this issue. I might be doing something wrong or might be not. But if it’s a bug, it’s worth being reported.

Hi @cubx,
This is not a bug, it’s expected. The thing is that even if the result from the digitization is correct, the Machine Learning Extractor does not use the text result from the digitization activity; it has its own text detection algorithm.
For situations like these you can use the Validation Station to correct the errors (I notice that the confidence for that field is 0% which means that errors should be expected for that one).

1 Like

Hello @AdiPopa,

Thanks for the answer, but… even this is not a bug, it just doesn’t sound right. If there is a way to get 100% confident result right out of the document text - we expect to get exactly that.

To add a bit more, I have 10 sample invoices, 8 of which are native PDF, and 2 out of these 8 have this recognition error. :man_shrugging:

Hi @cubx ,
I understand your point of view and we keep on trying to improve the extraction accuracy.
However, this technical decision was taken based on the fact that in most situations the ML extractor provides better results than the regular digitization. I see your situation as an exception, especially considering that from your 10 test documents, only 2 have this issue on a single field. As mentioned above, this is a situation where the Validation Station proves to be really handy :slight_smile:

Hello @AdiPopa,

2 out of 8 is 25%. It’s not that small number.

I am not saying that Validation Station is useless, I am saying that having correct results in Validation Station makes the process much faster and less dull for people. That’s what RPA is for, right? :slight_smile:

Anyway, I am asking if it’s possible to combine both approaches. Say let ML find the exact positions of the text to be extracted, but take the text from regular digitization. It can add a lot of value to the whole process because most of the invoices are coming as native PDF files these days.

Hello @cubx ,
Indeed, we should go for a combination of the approaches. We do have some improvements related to this in plan, like improving the digitization algorithm and then also using it for the ML extractor so …keep an eye out for this :wink:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.