How to classify different pdf documents

gregoryoffodum · October 9, 2021, 11:16am

Please quick one, I have different receipts (same format: native pdf) but the information are placed at different locations on each receipt. E.g Receipt number on top in one, and at the bottom in another.

I can use regex to get information from all the types of receipts. But how do I make the bot classify them to know which regex pattern to use for each document?

Would appreciate any help. Thanks

rahulsharma · October 9, 2021, 11:18am

if you have few variants of the pdf layout, you can just create the rule to check the type first. you need to find an anchor(a keyword, any string ) that is unique to one type and then whole processing you need to see keywords are present or not.

gregoryoffodum · October 9, 2021, 11:25am

okay. use an if statement to differentiate the documents?

rahulsharma · October 9, 2021, 11:28am

I meant you can analyze files with different layout then see what can be used to identify you’d particular layout, there will be something unique to that document. That unique keyword you can use in IF statement to identity the layout type and then have a dedicated extraction flow foe that layout.

Hope it helps else, if you can share couple of pdf with different layout that would help to suggest you better.

gregoryoffodum · October 9, 2021, 11:34am

okay, thank you. let’s use this sample for instance:

Pls it’s actually a pdf doc, just took a snapshot. The regex pattern I used to extract Receipt no: (?<=Receipt No\s+).*

Say “INVOICE” is “RECEIPT” in another document. or “Receipt No” is “Invoice No”. Pls how do I go about it?

rahulsharma · October 9, 2021, 11:40am

you got it right!

on this case Invoice ans receipt are in top right, so whole you read they will be first to be extracted. You can see the Invoice and Receipt Number will be on the first two lines and in the start

What you can do is just have a If condition to check if these teo keywords are at the beginning of the extracted text and are in the first two lines.
Just to give you an idea, it should be something like this :

Remove the blanks from the text by using yourExtractedText.Trim
Use the function yourExtractedText.Startswith(“Invoice + Environment.Newline +” Receipt")

gregoryoffodum · October 9, 2021, 11:51am

okay will try this now. Although I used regex to extract the fields

kumar.varun2 · October 9, 2021, 4:00pm

Hi @gregoryoffodum

You can simply use OR (pipe symbol) condition in the regex pattern.

e.g., for the ‘Invoice No:’ and ‘Receipt No:’ you can use

((?<=Receipt No:\s+).*)|((?<=Invoice No:\s+).*)

So, whatever be the case it will capture the receipt no/invoice no.

If there is any other variation of invoice no then you add it using pipe symbol.

You won’t need to classify the documents.

Pratik_Wavhal · October 9, 2021, 7:08pm

Hi @gregoryoffodum

If possible you can share those sample pdf formats so that the exact solution for all the data patterns we can provide to you.

Happy Automation

Best Regards
Er Pratik Wavhal

geetishree.rao · October 10, 2021, 3:02am

Dear Gregory,

You can use Document Understanding and use Form Extractors and key based classifiers.
The taxonomy manager hepls you to build diff structures based on the various doc type

The classifier helps you to classify and detect various documents based on the keywords in the document which you specify, in your case Invoice/receipt.

You can give a try using this concept, it is much better, reliable and extensible.

You can visit topics on document Understanding by UiPath Academy / Docs,Anders Jensen ,Lahiru Fernando ,ExpoHub

Thanks,
Geetishree Rao

gregoryoffodum · October 10, 2021, 11:20am

Thanks. Yes, I know about document understanding but it’s a bit too complex for this use case.

gregoryoffodum · October 10, 2021, 11:22am

okay. Here are the 3 types, trying to use regex but how identify which for every type is the issue. Thanks

invoice2.pdf (85 KB)
invoice3.pdf (85.5 KB)
1.pdf (183.4 KB)

gregoryoffodum · October 10, 2021, 11:23am

Thanks, will try this now

geetishree.rao · October 11, 2021, 4:37am

Dear Gregory,

You can differentiate the documents based on the below terms:

Extract the general text and then in a if or switch case (better option as options are 3),
search for the below 3 text and apply your regex to extract the required field data accordingly based on diff invoice types:

Case 1.“Tax Matters” (Only present in invoice of type invoice3.pdf)
Case 2.“Tax Invoice” (Only present in invoice of type invoice2.pdf)
Case 3.“it pays to pay your taxes.” (Only present in invoice of type 1.pdf)

Hope this helps.

Thanks,
Geetishree Rao

gregoryoffodum · October 11, 2021, 7:49am

Thanks. This should do it. I know I can use variable.ToString.Contains(“text to be found”).

I can do it with IF statements, pls how do I use it with switch activity?

geetishree.rao · October 11, 2021, 8:17am

Dear Gregory,

Attaching a xaml depicting switch case usage for your problem.

Please mark this as the solution,if it helps you so as to help others facing similar issues.

I have placed “text to be found” for the Case3,please update that accordingly to “it pays to pay your taxes.”
ForumRegexInvoice.xaml (6.7 KB)

Thanks,
Geetishree Rao

gregoryoffodum · October 11, 2021, 9:21am

okay. Trying to understand the last 2 components of this expression if(strVar.Contains(“Tax Matters”),“TaxMatters”,strVar) i.e “TaxMatters” and strVar. I assume I’m creating separate strVar variables for each case too.

geetishree.rao · October 11, 2021, 9:27am

Dear Gregory,

Below is the explanation:
strVar is a string variable initialized to blank and then set accordingly to classify the document type.

Just Replace the strVar in the below part to your variable which contains the entire extraction pdf text:

Instead of
if(strVar.Contains(“Tax Matters”),“TaxMatters”,strVar)
Use below:
if(ExtractectedtextVariable.Contains(“it pays to pay your taxes”,“PayTax”,strVar)

if(ExtractectedtextVariable.Contains(“Tax Invoice”),“TaxInvoice”,strVar)

if(ExtractectedtextVariable.Contains(“Tax Matters”),“TaxMatters”,strVar)

Hope this explains your confusion.

Thanks,
Geetishree Rao

gregoryoffodum · October 11, 2021, 9:38am

Thanks Geetishree. I get the .Contains method, just the “PayTax”, “TaxMatters”, “TaxInvoice” strings. Haven’t used the switch activity before so I’m just trying to understand that part. Thanks again

geetishree.rao · October 11, 2021, 9:51am

Dear Gregory,
These are just user defined terminologies to classify the 3 documents which I have used to create for the 3 file types.
These names are used in the switch case
You can give any name.
And moreover you can just use your convenient way to handle te same if you are not comfortable with switchcase.

If this helped, could you mark this as the solution to help others facing similar issues.

Attaching a xaml for your reference.
ForumRegexInvoice.xaml (11.1 KB)

Thanks,
Geetishree

Most Active Users - Yesterday
ashokkarale
Anil_G
ppr
shyamala_shyamu
Josh_James
m-takeda1
Shahabaz
harshika.10732998
sharazkm32
More details...