Trying to extract a set of specific information in text file

I have converted a list of company PDF files into text file. The only issue now is that the information is shown as:

Name NRIC Citizenship Position Date
James XX XXXX XXXXXXXX XXX XX/XX/XX
Jack XX XX (same as above)
July xx xx xx (same as above)

The names of the staff can vary from 2-5 words, citizenship can vary too. Hence, I have no idea how to even locate specific information like name,nric,position and date regardless of the number of words per line. Hope someone is able to help!!

Hi @Fedriq_N,

I have done something similar to what you are describing.

If there is a newline after each item as you have shown here, you could split the text up by newline to make the data easier to work with.

StringArrayVar = TextfileVar.Split({System.Environment.NewLine},StringSplitOptions.None)

On top of this, if the data from the PDF file was previously in a table before you read it to a textfile, I noticed that sometimes there will be double spaces " " or similar separating the information which can be used to separate the “columns” up. This could be used if there aren’t any double spaces contained in the data you want:

image

I hope this helps helps. Unfortunately it’s difficult to work out a perfect solution without the file being present, but you can’t share it because it contains sensitive info.

1 Like

Hi @Fedriq_N !

We need more details to help: would you mind replacing the xx by dummy realistic data please ? :grin:

  • For instance, for the dates, is it dd/mm/yyyy or dd/mm/yy or other ?
  • For the NRIC is it a certain number of digits (or characters) ?
  • For the position, what type of values are we expecting ? Numeric, letters, common words, a mix ?
  • For the citizenship, do we have something in particular (like an excel file that already contains all the nationalities), or maybe the nationalities are in capital letters so a regex pattern can’t get it wrong ?

Let us know !

Name ID Nationality/Citizenship Source of Date of Appointment
Address
Address Position Held

James M Para S1234567G SINGAPORE CITIZEN ACRA 16/04/1980

11 LUCKY CRESCENT Director
SINGAPORE (123456)

Lily W/O Mars Dune S2234567B SINGAPORE CITIZEN ACRA 16/04/1980

11 LUCKY CRESCENT Director
SINGAPORE (123456)

SOHANKUMAR JAGDISHCHANDRA PAREKH S3345678B SINGAPORE CITIZEN ACRA 02/01/2001

11 LUCKY CRESCENT Director
SINGAPORE (123456)

KIRANKUMAR JAGDISHCHANDRA PAREKH S4456789J SINGAPORE CITIZEN ACRA 30/12/1997

11 LUCKY CRESCENT Director
SINGAPORE (123456)

^^ Is the format/information thats generated after I converted to Text File. Changed the sensitive information for security purposes. Hope you are able to help :slight_smile:

1 Like

The NRIC can differ to either 9 or 8 digit/characters depending on the country, but mostly should be 9 characters. I tried using “ACRA” as a benchmark to string manipulate in order to obtain the name and NRIC. Also used an IF function to determine the number of Arrays. If the citizenship has the word “Citizen” it recognises the last 2 arrays as part of the citizenship i.e. “Singapore Citizen”, if the robot doesn’t detect the word, it should be 1 array i.e. “Australian, Indian, Japanese”.

1 Like

Hi @Fedriq_N !

Still working on it:

I need more information:

  • let’s say that the person has only the french nationality, will it appear as FRENCH CITIZEN ACRA ?
  • let’s say that the person has 2 nationalities (french + singapore) will it appear as FRENCH SINGAPORE CITIZEN ACRA ?
  • for the address and position held, I am having some troubles: the address that you want is 11 LUCKY CRESCENT SINGAPORE (123456) right ? (so on 2 lines) or is 11 LUCKY CRESCENT enough ?
  • The position is only one word at the end of the line ? (Director, etc)

Thanks :smiley:

Can I ask you if it’s possible to search for a keyword in a textfile and delete EVERYTHING below that keyword? Either save the entire chunk of text below the keyword as a variable and string.Remove or scan the entire chunk of text and replace with " "?

Yes we can try :grinning_face_with_smiling_eyes:
What would you be searching for as a keyword ?

Abbreviation

UL - Local Entity not registered with ACRA

UF - Foreign Entity not registered with ACRA

AR - Annual Return

AGM - Annual General Meeting

FS - Financial Statements

FYE - Financial Year End

OSCARS - One Stop Change of Address Reporting Service by Immigration & Checkpoint Authority.

Authentication No. : L12345678O

Page 3 of 4ACCOUNTING AND CORPORATE REGULATORY AUTHORITY
(ACRA)

WHILST EVERY ENDEAVOR IS MADE TO ENSURE THAT INFORMATION PROVIDED IS UPDATED AND CORRECT. THE AUTHORITY
DISCLAIMS ANY LIABILITY FOR ANY DAMAGE OR LOSS THAT MAY BE CAUSED AS A RESULT OF ANY ERROR OR OMISSION.

Business Profile (Company) of SMT INTERNATIONAL PTE LTD (123456789Z) Date: 29/06/2020

Note :

  • The information contained in this product is collated from lodgements filed with ACRA, and/or information collected by other government sources.

  • The list of officers for this entity is available for online authentication within 30 days from the date of purchase of this Business Profile. Please scan
    the QR code available on the last page of this profile to access the authentication page. For more information, please visit www.acra.gov.sg.

FOR REGISTRAR OF COMPANIES AND BUSINESS NAMES
SINGAPORE

RECEIPT NO. : ACRA123456789101 (Free Business Profile by ACRA)

DATE : 29/06/2020

This is computer generated. Hence no signature required.

Authentication No. : L12345678O

Page 4 of 4

^^
I would like to delete everything below the word “Abbreviation”. There is a bunch of text above the word but i just copied and paste this portion.