Shred each line on PDF to new rows on excel

Hello,

I would like to shred each line in my PDF to successive rows in an excel file. Here is what I’m looking for:

  1. Each line should be written into new rows in excel.
  2. A column against each row which identifies the paragraph number, line number and page number of line that was shredded.
  3. I have a dictionary of keywords on excel, I want to check if each shredded line contains ANY of my keywords. If the keyword is detected, the corresponding cell in excel needs to be highlighted.

I’m a beginner and so far I have been able to read the PDF file, write data into a text file and then to the excel. However, the entire data from the text file is being written into the first row. I want each line to be written into new rows. I tried using the split function, but am not able to achieve this.

Could one you please help me out?

I’m actually working on an enterprise project, if anyone of you is interested to collaborate please do let me know.

Thanks!

No need to write it to a text file, you can do it all within memory.

  1. Read pdf and save to a string variable i’ll call PDFtext
  2. Build datatable activity. Have it be completely blank, but create one string column named PDFtext. Save to a datatable variable I’ll call dt1.
  3. In a for each activity (be sure to change the TypeArgument property to string): For each line in Strings.Split(PDFtext,Environment.Newline)
    3a. Add datatable activity. Datatable to add: dt1 ArrayRow: {line}
  4. Write range activity: choose your excel file path and sheet name. Datatable = dt1

With the above 4 steps you should have an excel file where each row in the PDF is now in an excel row. Now that you’ve done that you can tackle the next portion separately. I don’t want to make the entire workflow for you as that would inhibit your learning :slight_smile:

Thanks @Dave. I was able to follow most of your instructions.

I’m getting the following error:

19.7.0+Branch.master.Sha.8c253d13718eed5c7db27daef6facd1fe1b0d067

Source: Add Data Row

Message: Cannot set column ‘PDFtext’. The value violates the MaxLength limit of this column.

Exception Type: System.ArgumentException

RemoteException wrapping System.ArgumentException: Cannot set column ‘PDFtext’. The value violates the MaxLength limit of this column.
at System.Data.DataColumn.CheckMaxLength(DataRow dr)
at System.Data.DataTable.RaiseRowChanging(DataRowChangeEventArgs args, DataRow eRow, DataRowAction eAction, Boolean fireEvent)
at System.Data.DataTable.SetNewRecordWorker(DataRow row, Int32 proposedRecord, DataRowAction action, Boolean isInMerge, Boolean suppressEnsurePropertyChanged, Int32 position, Boolean fireEvent, Exception& deferredException)
at System.Data.DataTable.InsertRow(DataRow row, Int64 proposedID, Int32 pos, Boolean fireEvent)
at System.Data.DataRowCollection.Add(Object values)
at UiPath.Core.Activities.AddDataRow.Execute(CodeActivityContext context)
at System.Activities.CodeActivity.InternalExecute(ActivityInstance instance, ActivityExecutor executor, BookmarkManager bookmarkManager)
at System.Activities.ActivityInstance.Execute(ActivityExecutor executor, BookmarkManager bookmarkManager)
at System.Activities.Runtime.ActivityExecutor.ExecuteActivityWorkItem.ExecuteBody(ActivityExecutor executor, BookmarkManager bookmarkManager, Location resultLocation)

When building the datatable, be sure the MaxLength for the field is set to -1 (this is the default, and means there is no max length)

@Dave, this worked! thanks.

Could you please guide me on the next steps?

  1. I want to compare each line in the excel file with my keywords dictionary which is also in excel. If the keyword is present in the particular row line, the same should be highlighted.
  2. Is there a way to append in a separate column the line number, paragraph number and page number from the source PDF file against each of the rows in excel?

Thanks!

I dont think it’s possible to do part 2. At least not in any way that I can think of that would be feasible for a reliable business solution.

For part 1, you can do this easily enough. Are you actually using a dictionary for the keywords? I would instead recommend saving the keywords to an array of strings. I’ll call this arrKeywords I will also assume you still have the datatable variable called dt1 from before. If you don’t, then use the read range activity from excel and save as dt1 (or whatever you want to name the variable)

Use a ‘for each row’ activity for dt1. The only activity in this for each row will be a single if activity.
In an if activity, put the following as the condition: arrKeywords.Contains(row.item(“PDFtext”).ToString, StringComparer.CurrentCultureIgnoreCase)
On the true side, use the ‘Set Range Color’ activity. The color can be anything you want, check out the options by typing color. and seeing what pops-up. The range should be: “A” + (dt1.indexof(row) + 1).ToString - thiis makes it so it colors the current cell in the “A” range. The +1 is because the datatable starts at index = 0, but excel starts at index = 1. You can provide another column if you want it to extend beyond column A. Just append + “Z” + (dt1.indexof(row) + 1).ToString (assuming you want columns A:Z highlighted).
Leave the false side blank

pdf to excel.xaml (17.8 KB)
dictionary.xlsx (15.2 KB)

@Dave Thanks for your guidance. I have followed your steps, however the items are not being highlighted in the excel. I’m attaching my workflow here, could you please take a look?
I also read in one of the forum topics that the string array can hold only 3 items - is it true? Do I need to use a list for this? I’m also attaching the dictionary file here.