What text structure would you want to get? I understand that you are just extracting data from this text, so do you need these paragraphs? What structure were you getting previously? What was the issue with it?
To be honest the structure probably won’t effect the end result. Just easier to read while I am diagnosing each issue. And I may try to use the special characters to help me get some of the data I need as it produces extra breaks. I have run into a problem on one piece of data where the data I need is always between two words but there is other data there too that could be before or could be after the text that I want, no fixed length of strings or string count to use to pinpoint my data.
What “other data” are you looking for? Is there any way you can distinguish it from the text it is before or after? If we can somehow pinpoint that “core” your data precedes or follows… Or do you just know it from the text analysis not available to a robot? Can you give an example of such text (can be surely something made up, just to present the pattern).
“Title deed number 355566 “text I want” 5557 Area”
The text I want is always between “title deed number and “area”
Now there may or may not be numbers (or text) either side if the text I want, random amount of spaces between data on different pdf extracts etc. My data may also contain numbers.
I’m looking at the data with the special characters (Arabic) and it will give me many more options to split the data where I want.
I copy the special characters to use in the matches process but doesn’t like it.
Ok, I have added : into my allowed characters. This may help.
OK, as for Arabic, here is what might help - in case you still haven’t installed it:
As for the regex, give me a moment… but just to confirm - in the given example would you want to extract “5557” together with the text you want or not (as you have stated: “The text I want is always between “title deed number and “area””)? If not, is this unwanted part that precedes “Area” always going to be one string not separated by spaces (no matter if containing digits with letters or solely any of them)? Is the unwanted part right after the expression “Title deed number” always going to be such string with no spaces too?Your string example:
“Title deed number 355566 “text I want” 5557 Area”
I would just want the “the text i want” and unfortunately what follows could be number or text and could be more than one string!
But I think the : will help me. Working on it now
Not sure why this has started happening,
This part of the sequence finds whether there is data in the txt (pdf) or not. If not then moves the file. But for some reason now it has continued to extract further data from later sequences from that file ?!!
What further data does it extract? Do you mean that “Name Arabic:” is not working as a delimiter any more?
After name I’m then extracting email address, phone number, project name, property number, and some more. Each with its own difficulties.
But the name extraction is fine. Just in that sequence above, which is the first sequence, if there is no name then i want to move the file and move onto the next. But I seems to be moving the file but still taking the data and putting it into excel
And what is your condition in IF? That GoodPDForNot? Also, perhaps it starts making sense for your to either begin posting each issue separately to let others keep the track and help you, or to send them via messages to me
If that text doesn’t exist in the txt file then there is no data in it and I don’t want to process it any further and move it to another folder
From the most recent changes you have mentioned adding a colon into your allowed characters - perhaps something with it is affecting your condition… Would it work for you to re-write the condition to: ‘PDFText.Contains(“SECOND PARTY”) and PDFText.Contains(“BUYER”) and PDFText.Contains(“NameEnglish”)’ - if “NameEnglish” is really the text you are looking for and not some your string variable name - then it would be put without these quotation marks. Please see the syntax advised below:
Annoyingly this became very easy once I had put the colons in. All done now apart from the move file problem
The sequence does detect whether there is data or not and moves the file accordingly but then continues the work flow with that file extracting data and putting it into excel. I want to move file then stop processing it
I don’t know your whole workflow, but I am guessing that probably this unwanted file sticks incorrectly to some variable which you are processing further (hence it is actioned anyway), e.g. it is always worth to check how your For Each is built and what it uses, as well as what your condition If contains… I am just curious - have you applied my “PDFTest.Contains” version? Does this problem with Move file still persist?
Let me check this. I might be talking nonsense. Let me run the files again
So, it moves the files ok once detected no data. But still does add random stuff into the excel sheet.
Is there no way of stopping processing the file once moved.
Every extraction process is inside one ForEach file in directory.getifles etc.
I have done it like this so it writes data row after all data extracted per file.
So absolutely every part where your automation is pulling out any data from the files and then entering it into an excel is included in your “Then” of your IF condition PDFTest.Contains? Or is any part like this left outside your condition?
Only buyers name extraction is in “then”. Then the rest is after
I guess that’s the answer
I need to put everything else into “then”?
Put it in a separate sequence and invoke to the “then”??
I would start from here You may test it first though by placing just a part and see if this "affects the effect ".