Urgent for scraping pdfs

hello everyone I have a data table I want to pass variables to one of the columns to put a changeable regex so I can get a values
Note: has anyone has a solution for unstructured pdf the data inside the table it is not stable so I had to got to regex but by somehow i find it with the changeable data hard to scrap caus it is considered as a dynamic data, not static which doesn’t have a constant number of the item.
i want to scrap the table in csv file JoeyTribbiani_01102020_281092.pdf (25.7 KB) MonicaGeller_09052020_87654.pdf (15.9 KB)


this can be done using Regex do as follows

  1. read using pdf to text activity (let say output : mypdf)
  2. remove Header details we are only focusing on Table to remove use this Pattern “(.INVOICE.\sInvoice.\sInvoice.\sDue.*)”
  3. to extract table use use a simple method we will build CSV string
  4. Identifying table headers use assign activity and for the left side mypdf = Regex.replace(mypdf,"(?<=ID)(\s{1})|(?<=DESCRIPTION)(\s{1})|(?<=QTY)(\s{1})|(?<=PRICE)(\s{1})","!")
  5. we will use “!” as our delimiter
  6. Identify ID column use assign activity and for the left side mypdf = Regex.replace(mypdf,"(?<=^\d{2})(\s{1})","!")
  7. mypdf = Regex.replace(mypdf,"(?<=\d{1,}.\d{2})(\b \b)|(\b \b)(?=\d{1,}.\d{2})","!") this will identify total and Price column
  8. mypdf = Regex.replace(mypdf,"((\s{1})(?=\d{1,}!.\d{1,}..\d{2}))","!") this will identify Item colum
  9. mypdf = Regex.replace(mypdf,"(\s)(?=^[A-Z]{3})","") this will identify line breaks and it will remove it and combine it as single line
  10. now use Generate data table activity and pass your created csv string and use first record as header and also delimiter used in the step 5

for this entire flow you need String Split and Regex Replace to build your code use Assign and Generate data table activity

Please test it to min 30 invoices and harden pattern above patterns are not well tested but you can do it using regex

Finally I highly recommend you to use Uipath Document Understanding framework . it’s lot more easy

thanks for your support and can you please share with me the workflow and thanks for you .