I have converted a list of company PDF files into text file. The only issue now is that the information is shown as:
Name NRIC Citizenship Position Date
James XX XXXX XXXXXXXX XXX XX/XX/XX
Jack XX XX (same as above)
July xx xx xx (same as above)
The names of the staff can vary from 2-5 words, citizenship can vary too. Hence, I have no idea how to even locate specific information like name,nric,position and date regardless of the number of words per line. Hope someone is able to help!!
On top of this, if the data from the PDF file was previously in a table before you read it to a textfile, I noticed that sometimes there will be double spaces " " or similar separating the information which can be used to separate the “columns” up. This could be used if there aren’t any double spaces contained in the data you want:
I hope this helps helps. Unfortunately it’s difficult to work out a perfect solution without the file being present, but you can’t share it because it contains sensitive info.
We need more details to help: would you mind replacing the xx by dummy realistic data please ?
For instance, for the dates, is it dd/mm/yyyy or dd/mm/yy or other ?
For the NRIC is it a certain number of digits (or characters) ?
For the position, what type of values are we expecting ? Numeric, letters, common words, a mix ?
For the citizenship, do we have something in particular (like an excel file that already contains all the nationalities), or maybe the nationalities are in capital letters so a regex pattern can’t get it wrong ?
^^ Is the format/information thats generated after I converted to Text File. Changed the sensitive information for security purposes. Hope you are able to help
The NRIC can differ to either 9 or 8 digit/characters depending on the country, but mostly should be 9 characters. I tried using “ACRA” as a benchmark to string manipulate in order to obtain the name and NRIC. Also used an IF function to determine the number of Arrays. If the citizenship has the word “Citizen” it recognises the last 2 arrays as part of the citizenship i.e. “Singapore Citizen”, if the robot doesn’t detect the word, it should be 1 array i.e. “Australian, Indian, Japanese”.
let’s say that the person has only the french nationality, will it appear as FRENCH CITIZEN ACRA ?
let’s say that the person has 2 nationalities (french + singapore) will it appear as FRENCH SINGAPORE CITIZEN ACRA ?
for the address and position held, I am having some troubles: the address that you want is 11 LUCKY CRESCENT SINGAPORE (123456) right ? (so on 2 lines) or is 11 LUCKY CRESCENT enough ?
The position is only one word at the end of the line ? (Director, etc)
Can I ask you if it’s possible to search for a keyword in a textfile and delete EVERYTHING below that keyword? Either save the entire chunk of text below the keyword as a variable and string.Remove or scan the entire chunk of text and replace with " "?
OSCARS - One Stop Change of Address Reporting Service by Immigration & Checkpoint Authority.
Authentication No. : L12345678O
Page 3 of 4ACCOUNTING AND CORPORATE REGULATORY AUTHORITY
(ACRA)
WHILST EVERY ENDEAVOR IS MADE TO ENSURE THAT INFORMATION PROVIDED IS UPDATED AND CORRECT. THE AUTHORITY
DISCLAIMS ANY LIABILITY FOR ANY DAMAGE OR LOSS THAT MAY BE CAUSED AS A RESULT OF ANY ERROR OR OMISSION.
Business Profile (Company) of SMT INTERNATIONAL PTE LTD (123456789Z) Date: 29/06/2020
Note :
The information contained in this product is collated from lodgements filed with ACRA, and/or information collected by other government sources.
The list of officers for this entity is available for online authentication within 30 days from the date of purchase of this Business Profile. Please scan
the QR code available on the last page of this profile to access the authentication page. For more information, please visit www.acra.gov.sg.
FOR REGISTRAR OF COMPANIES AND BUSINESS NAMES
SINGAPORE
RECEIPT NO. : ACRA123456789101 (Free Business Profile by ACRA)
DATE : 29/06/2020
This is computer generated. Hence no signature required.
Authentication No. : L12345678O
Page 4 of 4
^^
I would like to delete everything below the word “Abbreviation”. There is a bunch of text above the word but i just copied and paste this portion.