Need to extract start date and end date details using regex

Read data from PDF and have to extract all data in start date and end date from a particular row that have a title-Home Health Episodes
If the title is available then extract the date details:
image

Here there might be multiple no.of start and end dates and after getting each date individually, I have to compare that start date with another date and similarly end date with another date.

Hi @deepaksvg99

You can try the following approach to resolve this:

Assign activity -
titleLine = Regex.Match(pdfText, "Home Health Episodes.*").Value

Assign activity -
dateMatches = Regex.Matches(titleLine, "Start Date\s+(\d{2}/\d{2}/\d{4})\s+End Date\s+(\d{2}/\d{2}/\d{4})")

If activity -
dateMatchFound = dateMatches.Count > 0

startDate = dateMatches(0).Groups(1).Value
endDate = dateMatches(0).Groups(2).Value

Hope this helps,
Best Regards.

Hi @deepaksvg99 - Can you please share the format of text, that it is giving after reading a pdf. This would help to provide the correct regex

pdftext.txt (210 Bytes)

Hope this will help you

Hi @deepaksvg99

Can you try with these steps-

  1. Use the “Read PDF Text” activity to read the text from the PDF file.
  2. Use the “Matches” activity to extract the row that contains the title “Home Health Episodes”. You can use a regular expression to match the title.
  3. Once you have the row, use the “Matches” activity again to extract the start and end dates. You can use regular expressions to match the date format in the row.
  4. You can then compare the start and end dates with another date using the “DateTime.ParseExact” method or other date/time functions in UiPath.

@deepaksvg99 Please check the attached workflow and check how it helps

  • Use below exp to check the Home Health Episodes present or no, using if
System.Text.RegularExpressions.Regex.IsMatch(Input,"Home\s+Health\s+Episodes")
  • Apply below exp to get the start date and end date that was below the Home Health Episodes
System.Text.RegularExpressions.Regex.Matches(Input,"(?<=Home\s+Health\s+Episodes([\n\r]+.*){2})(\d{2}/\d{2}/\d{4})")

PDFText.zip (3.2 KB)

Hi,

How about the following?

m = System.Text.RegularExpressions.Regex.Match(strPdf,"Home\s+Health\s+Episodes\s*\nStart\s+Date\s+End\s+Date.*\n(?<STARTDATE>\d{2}/\d{2}/\d{4})\s+(?<ENDDATE>\d{2}/\d{2}/\d{4})")

Then

m.Groups("STARTDATE").Value
m.Groups("ENDDATE").Value

Sample20240411-1aL.zip (2.7 KB)

Regards,

Hi Arjunshenoy,
I tried with your code but able to get home health episode but not the start date

Hi Yochi,
Thanks for your response
If there are multiple dates then will this work?
I think it will collect only first start date and end date.

pdftext.txt (350 Bytes)

Hi Usha,
Thanks for your response
If there are multiple dates then will this work?
pdftext.txt (350 Bytes)
.

Hi,

Can you try the following?

mc = System.Text.RegularExpressions.Regex.Matches(strPdf,"(?<=Home\s+Health\s+Episodes\s*\nStart\s+Date\s+End\s+Date.*\n|\G)(?<STARTDATE>\d{2}/\d{2}/\d{4})\s+(?<ENDDATE>\d{2}/\d{2}/\d{4}).*\n")

Then iterate mc using ForEach

m.Groups("STARTDATE").Value
m.Groups("ENDDATE").Value

mc is MatchCollection type

Sample20240411-1aLv2.zip (3.2 KB)

Regards,

Hi Yoichi,
Thank you for your response
The code that you just provided is working fine however it is constrained to only one or two or three rows.
if there are dynamic rows like more that 5 rows then that is not working as it is considering only first two dates.
Assume that there can be some times one row some times mutiple rows like more than 5 rows then the bot should take/ consider all values rather than restricted to only one or two.
pdftext.txt (619 Bytes)

image
pdftextdata.txt (400 Bytes)
in this the out put contains another random number coming so because of it the bot is not considering the dates below it.
as a result it is only taking three dates rather than 4…pls refer latest pdftextdata file provided
image

in such scenarios is there any way to overcome that and continue to consider the date

Hi,

How about the following? This extracts all the date from the beginning of the line. (This ignores headers. If you need to consider with headers, it’s necessary to clarify rule of the table range)

mc = System.Text.RegularExpressions.Regex.Matches(strPdf,"(?m)(?<STARTDATE>^\d{2}/\d{2}/\d{4}) +(?<ENDDATE>\d{2}/\d{2}/\d{4}) *(?<EARLIEST>\d{2}/\d{2}/\d{4})? *(?<LATEST>\d{2}/\d{2}/\d{4})?")

Sample20240411-1aLv3.zip (4.0 KB)

Regards,

Hi Yoichi,
Sorry for late reply.
The code that was shared is working good. However if the input is in the way that is present in the shared
pdftext_2.txt (2.1 KB)
input file it is not working.
Here in the input file we need to extract only the start date and end date under Home Health Episodes only. Have to leave the remaining all other details.
Thanks in advance

Hi,

Probably it’s better to extract parts of Home Health Episodes, then use the above regex.
Can you try the following sample?

mcPart = System.Text.RegularExpressions.Regex.Matches(strPdf,"Home Health Episodes\s*\nStart\s+Date\s+End\s+Date.*\n(\d.*\n)+")

Sample20240411-1aLv4.zip (4.9 KB)

Regards,

Thank you Yoichi.
Its working :grinning:

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.