I would like to shred each line in my PDF to successive rows in an excel file. Here is what I’m looking for:
Each line should be written into new rows in excel.
A column against each row which identifies the paragraph number, line number and page number of line that was shredded.
I have a dictionary of keywords on excel, I want to check if each shredded line contains ANY of my keywords. If the keyword is detected, the corresponding cell in excel needs to be highlighted.
I’m a beginner and so far I have been able to read the PDF file, write data into a text file and then to the excel. However, the entire data from the text file is being written into the first row. I want each line to be written into new rows. I tried using the split function, but am not able to achieve this.
Could one you please help me out?
I’m actually working on an enterprise project, if anyone of you is interested to collaborate please do let me know.
No need to write it to a text file, you can do it all within memory.
Read pdf and save to a string variable i’ll call PDFtext
Build datatable activity. Have it be completely blank, but create one string column named PDFtext. Save to a datatable variable I’ll call dt1.
In a for each activity (be sure to change the TypeArgument property to string): For each line in Strings.Split(PDFtext,Environment.Newline)
3a. Add datatable activity. Datatable to add: dt1 ArrayRow: {line}
Write range activity: choose your excel file path and sheet name. Datatable = dt1
With the above 4 steps you should have an excel file where each row in the PDF is now in an excel row. Now that you’ve done that you can tackle the next portion separately. I don’t want to make the entire workflow for you as that would inhibit your learning
I want to compare each line in the excel file with my keywords dictionary which is also in excel. If the keyword is present in the particular row line, the same should be highlighted.
Is there a way to append in a separate column the line number, paragraph number and page number from the source PDF file against each of the rows in excel?
I dont think it’s possible to do part 2. At least not in any way that I can think of that would be feasible for a reliable business solution.
For part 1, you can do this easily enough. Are you actually using a dictionary for the keywords? I would instead recommend saving the keywords to an array of strings. I’ll call this arrKeywords I will also assume you still have the datatable variable called dt1 from before. If you don’t, then use the read range activity from excel and save as dt1 (or whatever you want to name the variable)
Use a ‘for each row’ activity for dt1. The only activity in this for each row will be a single if activity.
In an if activity, put the following as the condition: arrKeywords.Contains(row.item(“PDFtext”).ToString, StringComparer.CurrentCultureIgnoreCase)
On the true side, use the ‘Set Range Color’ activity. The color can be anything you want, check out the options by typing color. and seeing what pops-up. The range should be: “A” + (dt1.indexof(row) + 1).ToString - thiis makes it so it colors the current cell in the “A” range. The +1 is because the datatable starts at index = 0, but excel starts at index = 1. You can provide another column if you want it to extend beyond column A. Just append + “Z” + (dt1.indexof(row) + 1).ToString (assuming you want columns A:Z highlighted).
Leave the false side blank
@Dave Thanks for your guidance. I have followed your steps, however the items are not being highlighted in the excel. I’m attaching my workflow here, could you please take a look?
I also read in one of the forum topics that the string array can hold only 3 items - is it true? Do I need to use a list for this? I’m also attaching the dictionary file here.