Read data from PDF and have to extract all data in start date and end date from a particular row that have a title-Home Health Episodes
If the title is available then extract the date details:
Here there might be multiple no.of start and end dates and after getting each date individually, I have to compare that start date with another date and similarly end date with another date.
Use the “Read PDF Text” activity to read the text from the PDF file.
Use the “Matches” activity to extract the row that contains the title “Home Health Episodes”. You can use a regular expression to match the title.
Once you have the row, use the “Matches” activity again to extract the start and end dates. You can use regular expressions to match the date format in the row.
You can then compare the start and end dates with another date using the “DateTime.ParseExact” method or other date/time functions in UiPath.
m = System.Text.RegularExpressions.Regex.Match(strPdf,"Home\s+Health\s+Episodes\s*\nStart\s+Date\s+End\s+Date.*\n(?<STARTDATE>\d{2}/\d{2}/\d{4})\s+(?<ENDDATE>\d{2}/\d{2}/\d{4})")
mc = System.Text.RegularExpressions.Regex.Matches(strPdf,"(?<=Home\s+Health\s+Episodes\s*\nStart\s+Date\s+End\s+Date.*\n|\G)(?<STARTDATE>\d{2}/\d{2}/\d{4})\s+(?<ENDDATE>\d{2}/\d{2}/\d{4}).*\n")
Hi Yoichi,
Thank you for your response
The code that you just provided is working fine however it is constrained to only one or two or three rows.
if there are dynamic rows like more that 5 rows then that is not working as it is considering only first two dates.
Assume that there can be some times one row some times mutiple rows like more than 5 rows then the bot should take/ consider all values rather than restricted to only one or two. pdftext.txt (619 Bytes)
pdftextdata.txt (400 Bytes)
in this the out put contains another random number coming so because of it the bot is not considering the dates below it.
as a result it is only taking three dates rather than 4…pls refer latest pdftextdata file provided
in such scenarios is there any way to overcome that and continue to consider the date
How about the following? This extracts all the date from the beginning of the line. (This ignores headers. If you need to consider with headers, it’s necessary to clarify rule of the table range)
mc = System.Text.RegularExpressions.Regex.Matches(strPdf,"(?m)(?<STARTDATE>^\d{2}/\d{2}/\d{4}) +(?<ENDDATE>\d{2}/\d{2}/\d{4}) *(?<EARLIEST>\d{2}/\d{2}/\d{4})? *(?<LATEST>\d{2}/\d{2}/\d{4})?")
Hi Yoichi,
Sorry for late reply.
The code that was shared is working good. However if the input is in the way that is present in the shared pdftext_2.txt (2.1 KB)
input file it is not working.
Here in the input file we need to extract only the start date and end date under Home Health Episodes only. Have to leave the remaining all other details.
Thanks in advance