Regex Based Extraction Getting Garbage Data - DU

I have this data

i want to get the adress data of it i was thinking to use the regex based extraction as it is the best possible way to extract the exact data

there are 2 scenario in this

first doc (doc same but position of the adress field is below, with bare minimum 5 words to classify the doc)
second doc(doc same but position of aress field is on the right side , with the ‘same’ 5 words to classify the doc )

for both doc there are only 5 words (same 5 words) now in this it gets identify that its a adhar card style doc but the issue is with the position of the adress field which dosent gets idently becz of the position is changing

third issue is that the position of images are not fixed they are maybe little tilted or the aress position of the same axis are maye zomed in/out a litle

so for this reasons form extractor is out of question and same with ml extractor as it is also not provideing any data properly i tried the IDCards endpoint for it

so the solution i came up with was to use regex and extract data after the lable ‘Adress’
but what the issue im getting main is with the ocr itslef too for some docs in this

so like im getting some garbage data in between , i also just view the text only option and it is giving the same issue the garbage data is there too

my guess are it is maybe becuase of the hindi language adress that is used next to the english adress

this are some regex expressions i used

(?<=Address:\s).$
(?<=Address:\s).
(?<=\d{6})

Hi Sagar,

Can you write an Example, what exactly you want to extract after “Address” and what actually got extracted after your expressions used?

Did you try OCR to check what the document structure looks like after Digitalisation?

Yes i have checked the data in text only section of form extrator (just for undertsanding what data it is getting with the ocr)

This are some sample data from ocr

"4dT. C/O RTR aTTT21 0u PR 1676, ardc F aff Alfe

¿

R 2,& Zhieo8 a 9TR1, 1I Rcate olat Tifeuft raee 27, rlfguft 02t-1, BTR qfw4, fatl - 110085

Address : C/O Ram Asharey, H Number 1676, Ground Floor C Block PKt 2, Near DTU College, Re settlement colony Rohini Sector 27, Rohini Sector-7, North West Delhi, Delhi - 110085

------ Second Data ---- Below ------------

¿t FRT 469 y AADHAAR UNIOUE IDENTIFICATION AUTHORITY OF INDIA

4OT:

Address:

sreifit: AT 3cft, a- WO: Sajid Ali, B-59, room no-1, opp-sampurn store, Saidul

59, 14 T-1, ¿2 a ¿9, ¿ 3, A fmt, fwtt - 110030

Azaib, South Deihi, Delhi - 110030"

here there two data ,first which having correct data the adress part is proper

“Address : C/O Ram Asharey, H Number 1676, Ground Floor C Block PKt 2, Near DTU College, Re settlement colony Rohini Sector 27, Rohini Sector-7, North West Delhi, Delhi - 110085”

and here is another which dosent have proper data or garbage data

"Address:

sreifit: AT 3cft, a- WO: Sajid Ali, B-59, room no-1, opp-sampurn store, Saidul

59, 14 T-1, ¿2 a ¿9, ¿ 3, A fmt, fwtt - 110030

Azaib, South Deihi, Delhi - 110030"

i want only the data after the Adress Lable

for eg

“C/O Ram Asharey, H Number 1676, Ground Floor C Block PKt 2, Near DTU College, Re settlement colony Rohini Sector 27, Rohini Sector-7, North West Delhi, Delhi - 110085”

Hi @indiedev91

Can you try the following expression: (?<=Address\s:\s)(.+)

its not working with this

"t FRT 469 y AADHAAR UNIOUE IDENTIFICATION AUTHORITY OF INDIA

4OT:

Address:

sreifit: AT 3cft, a- WO: Sajid Ali, B-59, room no-1, opp-sampurn store, Saidul

59, 14 T-1, ¿2 a ¿9, ¿ 3, A fmt, fwtt - 110030

Azaib, South Deihi, Delhi - 110030"

The garbage data means that the OCR is not extracting correctly right!
for the correct data (?<=Address\s:\s)(.+) this would work
when you checked the data after OCR are you getting it correctly!

yes thats what i have said previously in the question , but my concerns are also with the regex , that if the reuirment is to get any data available after this specif lable why it is not even getting the garbage data , after all it is also a data right? , only working with proper data

Try this
(?<=Address\s:\s)([^\W_]+)
It supposed to remove all the special characters,

no its not working

It seems that the format you posted is not like this after OCR/ Digitalization. Did you take the Output and copied it here in the same format?

How could i use Ai Center to make a custom Ml Model That extract adress from my given data , no matter where is position is ,

How could i do that ? and does the lableling in datalableing actauly only cover that specific area just how the form extractor do in UiPath studio or does the lableing in ai center just get that coordinates as a refrence but its not static or fixed position it just search for the data around and insife that specific area of selection ?

is there any type of traing where the ai model itsle just try to identify data points and categorized the data on itself without even us to give them a schema of data points or coordinates

if second one is possible it will really helpful as i will provide too many data that i have , yes i dont have any issue in correcting the data if the ai model is doing getting wrong data

Hi @indiedev91 ,

Looking at the type of documents shared, we might not be able to do a good extraction just using the Regex Extractor.

We would need to perform a Custom data extraction by using the Document Understanding Model which is available Out of the Box.

There are many tutorial videos already available on it. You could refer them and let us know if you are able to understand and perform it.