Parsing an unknown file type

Hi,

Is there a better way to parse this sample string to get display_name and content?To be honest, I’m not sure what type text file format this is. Thank you.

payload {
  annotation_spec_id: "4174790675084083200"
  display_name: "VIN"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 94
      end_offset: 118
      content: "2402-0161C-E2777JLP10000"
    }
  }
}
payload {
  annotation_spec_id: "8786476693511471104"
  display_name: "Last_Name"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 138
      end_offset: 143
      content: "DELA CRUZ"
    }
  }
}
payload {
  annotation_spec_id: "5622803508399964160"
  display_name: "First_Name"
  text_extraction {
    score: 0.9790445566177368
    text_segment {
      start_offset: 144
      end_offset: 152
      content: "JUAN"
    }
  }
}
payload {
  annotation_spec_id: "310104060474687488"
  display_name: "Middle_Name"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 153
      end_offset: 159
      content: "CRUZ"
    }
  }
}
payload {
  annotation_spec_id: "3015078586664091648"
  display_name: "DOB"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 174
      end_offset: 186
      content: "January 01, 1981"
    }
  }
}
payload {
  annotation_spec_id: "6338594374175162368"
  display_name: "Status"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 200
      end_offset: 206
      content: "Single"
    }
  }
}
payload {
  annotation_spec_id: "427795785111830528"
  display_name: "Citizenship"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 219
      end_offset: 227
      content: "Spanish"
    }
  }
}
payload {
  annotation_spec_id: "153076207842230272"
  display_name: "Address"
  text_extraction {
    score: 0.9999884963035583
    text_segment {
      start_offset: 236
      end_offset: 276
      content: "INIGO BLUE EXT, BO, OBRRO ST,"
    }
  }
}
payload {
  annotation_spec_id: "9144618417003692032"
  display_name: "Precinct_Code"
  text_extraction {
    score: 1.0
    text_segment {
      start_offset: 297
      end_offset: 302
      content: "0123C"
    }
  }
}

@iamthejuan It looks like it’s a Json String, but not in it’s format, However you can use it as a text file, and use Regex methods to extract the Required data like in the below link :

2 Likes

@iamthejuan
Regex and anchoring on property name and “:” can maybe sufficient enough.

another approach could be about fixing it with RegEx replaces:

  • split on payload
  • correct the strings on “” and commas to a target like this:
    grafik

so you can process it as standard json

3 Likes

Thank you, I will try it.

1 Like

It looks like protobuf text format.
If you don’t have a schema definition - you may create your own, based on the output and use TextFormat parser from protobuf library (or convert to json, etc)

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.