How to remove Unicode characters while Extracting data from pdf

Jai_Pande · November 16, 2023, 8:23pm

Hi All,

I am extracting data from a pdf using Document understanding, while i use Form Extractor it also extracts unicode characters(same as which are mentioned when you are creating template in form extractor, just below signature fields dropdowns), kindly help guys as the blocks are also getting extracted from the pdf and printing in excel as well

@ppr @supermanPunch @Palaniyappan @Gokul001 @postwick

Regards
Jai

supermanPunch · November 16, 2023, 8:33pm

Hi @Jai_Pande ,

Is it possible for you provide an Example String from the Extracted data so that we can validate it as well from our end.

Also, could check the below :

[^\u0000-\u007F]+

We would require to perform a Regex Replace operation and remove the matched unicode values with empty values.

Jai_Pande · November 16, 2023, 8:38pm

Hi @supermanPunch ,

Credit Card Application - Amily.pdf (990.0 KB)

Jai_Pande · November 16, 2023, 8:39pm

It involves the Box Characters in Name Fields

Jai_Pande · November 16, 2023, 8:43pm

Its coming like this

supermanPunch · November 16, 2023, 8:51pm

@Jai_Pande ,

As suggested you could try with the approach with Regex Replace mentioned above and let us know if it works with the extracted data in Excel.

For the Extraction part, I do not think a more control could be provided to remove or Exclude the Square brackets as it is also included in between the strings. So maybe a post processing could be a better choice where you could perform the String/Regex manipulations to remove the unwanted characters.

Jai_Pande · November 16, 2023, 8:55pm

@supermanPunch
Or can i use ML Extractor Endpoint also to just extract names?

supermanPunch · November 16, 2023, 9:02pm

@Jai_Pande ,

It depends on the Document type being used and the nature/template of the document.

Is the ML Extractor Model for the Form type already available, then you should be able to use it. Else a custom DU model would need to built (would not go there unless a careful analysis on the Document types/Formats/Templates are done and understood all the constraints and fields.

Jai_Pande · November 16, 2023, 9:06pm

@supermanPunch
Also let me know that when i am extracting the boolean value from the form its showing in different lines, like i want one field and then it should choose between the two instead of it have to define separate fields like this

Want that i should only define Card Type field and in form extractor it should show me that what it choose

Its showing like this

supermanPunch · November 17, 2023, 5:11am

@Jai_Pande ,

In the case when used with Form Extractor, I believe we would require to use separate fields for each value checkbox and then mark/label the Check for which the option is selected.

If for labelling in Document Manager where you are keeping the training data ready for DU ML Model, then you could use/define a Single Field for that Checkbox and select the Options instead of the Checkbox mark. But we should also make sure that we are training it with enough samples so that it would be having a balanced data on all sets of values.

Check the below docs :

Jai_Pande · November 17, 2023, 5:30am

@supermanPunch
Okay so if i am using labelling so then i have to only label the check box fields or all the fields as i am able to extract the data except for checkboxes

Jai_Pande · November 17, 2023, 5:39am

@supermanPunch

and how to select labelling getting a bit confused

Jai_Pande · November 17, 2023, 5:43am

request you to kindly arrange a small teams session @supermanPunch

thanks

Jai_Pande · November 17, 2023, 9:22am

@supermanPunch

Topic		Replies	Views
Getting hidden data from pdf using DU Studio studio , question , tools	8	1118	April 4, 2022
Document understand field extraction issue Activities question , document_understanding	2	876	September 2, 2021
Data extraction using Taxonomy Studio studio , question , activities_panel	9	777	July 23, 2022
Extract certain key words from multiple pdfs Activities pdf , activities , question	8	915	February 8, 2022
Not able to Read the PDF data Studio pdf	8	819	October 17, 2021

How to remove Unicode characters while Extracting data from pdf

Related topics