As a pattern with regex to extract total amounts?

Hello again :slight_smile:

Friends, one query please, I am trying to find a pattern with regex to extract total amounts from the following paragraph obtained by OCR:

The problem I have is that I don’t know how to select the full amount.

The thousands separator is “,” and the decimal separator is (.)

For example, here there are three paragraphs (the three will not be analyzed at the same time, only one paragraph at a time)

**paragraph 1 ******
"ITAU DEPOSITO CUENTA CORRIENTE MNA OF./191085-085B-T08766 OP-0685538 22/01/2022 Hora:09:37:33 FINANCIERA KATTE S.A. CODIGO DE CUENTA: 475-1656565-0-99 CCI 002475001896565 : IMPORTE DEPOSITADO: x
x*1,362.59"

*****paragraph 2 ******
“AGENTE ITAU EMPRESA MUILTISERVICIOS FECHA: 21/05/22 NAVARRO HORA: NO.OPE: 19:22:25 973386 H985108 -PAGO DE SERVICIOS GIRO/RUBRO: OTRAS EMPRESA: CTA. A KATTE ABONAR: S.A 1932425218548 COD. ID USUARIO: 44023852 NOMBRE: GARCIA CASTILLO TOMASA EN EFECTIVO DESCRIPCION PAGO DE PRESTAMOS FECHA VENCIM: 15/05/2022 IMPORTE CUOTA: 334,70 F/ CARGO FIJO: F/ 0.00 MORA: S/ 0.00 TOTAL CUOTA: F/ 334.70 CONISION: F/ 0.00 TOTAL A PAGAR: F/ 334.70”

*****paragraph 3 ******
“3:11 ☐ Vo)) O .ll 92% ☐ LTÉ X ☐ Financiera ITAU recibió: F/ 92.00 Martes 16 2022 Mayo 15:11 Destino Financiera KATTE S.a. Banco ITAU - 475 14895665 0 89 Moneda Pesos Origen Ahorro Pesos 193 39659874 0 76 Número de operación 44258549 III ☐ <”

The result should be:

For the first paragraph: 1,362.59
For the second paragraph: 334.70
For the third paragraph: 92.00

Hi @Lynx

Take a look here:
Paragraph 1 sample
Paragraph 2 Sample
Paragraph 3 Sample

One Pattern for all 3 paragraphs

I have used square brackets to capture any number of “.” “,” or digits 0-9.
Please review the curved brackets to ensure the words inside are constant.

Test it on some real samples using the above links and see what happens.

Cheers

Steve

1 Like

Hi,

The following expression will return what you expect, for now. However, in second paragraph it matches “334.70”, “0.00”, “0.00” and “334.70” and this returns the first “334.70”. Is this same as your intent?

System.Text.RegularExpressions.Regex.Match(yourString,"\d[,\d]+\.\d+").Value

Regards,

1 Like

Yes @Yoichi , I should always keep the first data that the regex extracts. It is with 10 types of vouchers that I extract data with OCR and the different types of vouchers are giving me problems.

Hey @Lynx

Using OCR will make using Regex not 100% reliable. You need to cater this into your solution design.

Take a look at this pattern. I have amended the pattern to make it less fragile (hopefully) on some of the harder characters to pickup with OCR. I would also enable case insensitive regex setting also.

Hopefully this helps.

Cheers

Steve

Thank you very much @Steven_McKeering or the update, you are right about the OCR and regex issue. The OCR I use is “UiPath Document OCR”, but the strange thing is that it inverts the words, for example, the following is printed on the voucher:

“Pago De Prestamos”

But the ocr captures me:

“De Pago Prestamos”

But, the regex helps a lot, thank you all very much! :slight_smile:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.