Help! how to extract information from unstructure data read from PDF

Hi All

I need help in extracting data from unstructure data read from PDF.
i have read PDF to .txt format and the result is this:

University of Strathclyde, UK Master of Business Administration with Highest Distinction
National University of Singapore Bachelor of Accountancy - 2nd Lower Honours
Pioneer Junior College GCE A Level
Beatty Secondary School GCE O Level
University of London Bachelor of Science(Artifical Intelligent) - 2nd Lower Honours
Jurong Junior College GCE A Level
Nanhua High School GCE O Level
Nanyang Technological University Bachelor of Information (Computer Science)
NUS High School NUS High School Diploma
National University of Singapore PhD in Electrical and Computer Science Engineering
Nanyang Technological University Bachelor of Engineering (Electrical and Electronic Engineering) - 2nd Upper Honours
Ngee Ann Polytechnic Diploma in Electronics, Computer and Communication Engineering
Queensway Secondary School GCE ‘O’ Level

I need the bot to help me cut the result and output the following:
University of Strathclyde, UK
Master of Business Administration with Highest Distinction
National University of Singapore
Bachelor of Accountancy
Pioneer Junior College
GCE A Level
Beatty Secondary School
GCE O Level
University of London
Bachelor of Science(Artifical Intelligent)
Jurong Junior College
GCE A Level
Nanhua High School
GCE O Level
Nanyang Technological University
Bachelor of Information (Computer Science)
NUS High School
NUS High School Diploma
National University of Singapore
PhD in Electrical and Computer Science Engineering
Nanyang Technological University
Bachelor of Engineering (Electrical and Electronic Engineering)
Ngee Ann Polytechnic
Diploma in Electronics, Computer and Communication Engineering
Queensway Secondary School
GCE ‘O’ Level

Can help with the worflow for this? the cut will be based on institution and course of study.
institution generally have the key words such as "Secondary School, Polytechnic, University, High School, Junior College "

Please help :frowning:

@f5f191b0815b26e83996fd67f,

If there is some fixed pattern which can identify each field values, use regex. You can ask any AI LLM for the regex of each field.

If no fixed pattern is there, you will have to use Document understanding for this.

@f5f191b0815b26e83996fd67f

Given the structure its better to use generative extractor so that you can ask questions for each field and it can get the data dynamically

Cheers

Hey @f5f191b0815b26e83996fd67f if the Document are in Large number means you have 100 of pdf in same format then i would suggest you to use Document Understanding it’s work faster for larger sets of pdf data.or you can use string manipulation method or regex option.
cheers

Use Document Understanding to extract the data based on similar format, also if you have to extract some field that is particularly present in very few documents you can use Generative Extractor with ML Extractor.

can guide me how to do it using the string manipulation method or regex option?
the UiPath version that i have is 2023.4.5, so i dont think got the document understanding fuction…

Hey @f5f191b0815b26e83996fd67f i can give you reference videos but its hard to give you
full logic.you have to practice it own and then extract the data .

cheers