Hi, I’m facing issue to extract data from PDF table. I tried suggestions from here but not able to solve the issue. Need help!
My data is confidential, but I will use sample file from other thread to explain - https://global.discourse-cdn.com/uipath/original/3X/0/8/08d920acd8924b1c5153f06859df13f22f60cb3b.pdf (see table on page 2). My PDF table is similar with few more columns. Using “Read PDF text” I get string with rows separated by new line and columns separated by space. I can use split string to separate the rows by new line, but the problem is when trying to separate the columns. I have no pattern that I can use to split the columns. Using above table, imagine some driver names 2 words but some have 3-4. And car name has no fixed pattern. Imagine another column called “Team” which also has entries with 2-4 words. So I think to get the data into datatable, I need to split columns in some other way.
I tried getting robot to open the pdf and scrape data. Data scraping is giving me weird ouput. I am thiking to use loop+get text to change selector to point to different cells of the pdf. But using UI explorer, not able to find attributes that point to different cells of the PDF table. I don’t know how to explore PDF structure in more details.
hi @sushildarveshi … thanks for suggetsions unfortunatly it doesn work
i explain what happens, maybe then u or someone else can easier to help me
i follow ur sequence…
1.get pdf text - ok
2. write text to txt file - ok but already no column strcuture (screenshot below)
3. open txt file - tried this manually,
4. copy table - done manually to test
5. past in excel - done, all get paste in first column
6. fixed width - unable to split by column (screenshot below)
screensht after write to text file (censor some info due to confidential):
I was facing same problem, but got it working…not the way I had written in my earlier post.
Here is what I am doing.
Read pdf text and store the text in some string variable - say output
Extracting say PO No - from the text
(output.Split({“P/O Number :”},System.StringSplitOptions.None)(1).Trim).Split({Environment.Newline},StringSplitOptions.None)(0).Trim
P/O Number is in your pdf as an identifier.
Highlighted 0 in above code ensures - the required text is pulled from same line as P/O Number. If you replace this 0 with 1 then it would extract the next line info.
If you want to read multiple lines one by one - replace 0 with variable v and use a do while loop increasing the value of variable v by 1
To end the loop, check for any identifier …mine had “****” at the end of table. Used that as an identifier to check if the table ended.
Once each line is extracted, you can extract part of the string using output.Substring (5,3)
…5 being first character where to start and 3 being number of characters.
But since car / driver can be multiple words… I don’t know how to identify which text is for which column. Please post the solution if you solve it… Even I need it… If I get it, I would post the solution as well