How to remove Unicode characters while Extracting data from pdf

Hi All,

I am extracting data from a pdf using Document understanding, while i use Form Extractor it also extracts unicode characters(same as which are mentioned when you are creating template in form extractor, just below signature fields dropdowns), kindly help guys as the blocks are also getting extracted from the pdf and printing in excel as well

@ppr @supermanPunch @Palaniyappan @Gokul001 @postwick

Regards
Jai

Hi @Jai_Pande ,

Is it possible for you provide an Example String from the Extracted data so that we can validate it as well from our end.

Also, could check the below :

[^\u0000-\u007F]+

We would require to perform a Regex Replace operation and remove the matched unicode values with empty values.

Hi @supermanPunch ,

Credit Card Application - Amily.pdf (990.0 KB)

It involves the Box Characters in Name Fields


Its coming like this

@Jai_Pande ,

As suggested you could try with the approach with Regex Replace mentioned above and let us know if it works with the extracted data in Excel.

For the Extraction part, I do not think a more control could be provided to remove or Exclude the Square brackets as it is also included in between the strings. So maybe a post processing could be a better choice where you could perform the String/Regex manipulations to remove the unwanted characters.

@supermanPunch
Or can i use ML Extractor Endpoint also to just extract names?

@Jai_Pande ,

It depends on the Document type being used and the nature/template of the document.

Is the ML Extractor Model for the Form type already available, then you should be able to use it. Else a custom DU model would need to built (would not go there unless a careful analysis on the Document types/Formats/Templates are done and understood all the constraints and fields.

@supermanPunch
Also let me know that when i am extracting the boolean value from the form its showing in different lines, like i want one field and then it should choose between the two instead of it have to define separate fields like this

image

Want that i should only define Card Type field and in form extractor it should show me that what it choose

image

Its showing like this

@Jai_Pande ,

In the case when used with Form Extractor, I believe we would require to use separate fields for each value checkbox and then mark/label the Check for which the option is selected.

If for labelling in Document Manager where you are keeping the training data ready for DU ML Model, then you could use/define a Single Field for that Checkbox and select the Options instead of the Checkbox mark. But we should also make sure that we are training it with enough samples so that it would be having a balanced data on all sets of values.

Check the below docs :

@supermanPunch
Okay so if i am using labelling so then i have to only label the check box fields or all the fields as i am able to extract the data except for checkboxes

@supermanPunch

and how to select labelling getting a bit confused

request you to kindly arrange a small teams session @supermanPunch

thanks

@supermanPunch