Hi ,
We have a scenario in which we want to extract standard info from a pdf. We are able to get the information in a text file and also it preserve the format too.
The problem which we are facing is with below scenario
aws gds ppt tyt iop eqw
0 1 0 8 9
How should we extract the above information
We have tried regex, split by space which is not able to help as u see that gds above has no value and if we use split by space there is a left shift of values and (1 for ppt ) comes under gds and so on .
Can anybody advise on what can be the approach on this
(aws has 0
gds has nothing belowit
ppt 1
and so on)
The statement here assumes that you have used PreserveFormat option enabled when using the Read PDF Text Activity.
With the PreserveFormat enabled, there are a couple of options to be checked, one is the below :
Using Generate Datatable from Text activity with Predefined column sizes.
If for sure, you know that the Table size and that of it’s columns would remain the same (could be also checked by testing with different data PDF), then we could go for this approach.
Let us also know if the data to be extracted has the table format, meaning it has all the Borders visible in the Table.
It would also be better if you could provide us with the Sample text file (PreserveFormat enabled), so that we can check on the extraction part as well.
Hi
After trying this option we could the same issue of having everything coming in one column which is not the required result
data extracted is in the tabular format which is visible in the pdf but not in the text file
please find attached text file. Let me know if everything can be an output in a separate cell. experiment.txt (3.5 KB)
I have detected the length/size of each of the columns and tried to use those sizes in the Generate Datatable from Text activity.
However, there are filtering/cleaning to be done at first to just take the portion of interest and perform the Table extraction from it.
Check if it is able to extract the table data properly from different Data files having varying data length in it’s values.
Let us know if it doesn’t work for some of them and maybe provide the data that doesn’t work, so we could conclude if Column Size values are a proper method for extraction or not for your case.
Hi
I am facing issue with opening the project where some of the activity is showing missing but still i am able to see the structure of the flow
Major issue which we will face is
we have pdf files each having more than 8 pages each having multiple table ( format is fixed and variable only for the places where address is coming up)
We will have to use multiple regex ex to extract the data and pre processing, we are looking for a option which directly convert pdf into a csv.
Any suggestion. Also it will be difficult to share the data as it is lot of manual effort on my side to mask the data
FYI
I have already implemented the solution in python using tabula wrapper but looking for similar in UiPath
Let me know what is the Version of Studio you are using ? And Have you tried to open it as a Separate Project or are you opening it within another project ?