Regular Expressions in UiPath

I’m extracting data from a PDF file and saving it into a String variable but the Input looks like

"Labor and Job Works All Amount in INR S.Hrs Labor Disc/ Taxable CGST SGST IGST Total S# Labor Code Description of Service SAC Code Rate Rate Rate Labor UOM Charges Rebate Value (%) Amount (%) Amount (%) Amount Amount 1 40101NAF Tyre thread depth check all wheels 998714 0.40 160.00 160.00 - 0.00 - 0.00 18.00 28.80 188.80 Hrs 2 332281AF Hub end play adjustment of both wheel 998714 1.00 400.00 400.00 - 0.00 - 0.00 18.00 72.00 472.00 with diagnosis Hrs 3 353415AF Rear axle (2) hub play adjustment 998714 2.00 800.00 800.00 - 0.00 - 0.00 18.00 144.00 944.00 Hrs Door check strap on 1st door as rotary 0.20 4 720214AF door,LH replace (with hinged door 998714 Hrs 80.00 80.00 - 0.00 - 0.00 18.00 14.40 94.40 removed) 5 353411AF Rear axle hub assembly both remove/ 998714 4.50 1,800.00 1,800.00 - 0.00 - 0.00 18.00 324.00 2,124.00 install Hrs Sub Total 3,240.00 3,240.00 0.00 0.00 583.20 3,823.20 "

But I want output in a different format, it’s should be in this format

{
" 1 40101NAF Tyre thread depth check all wheels 998714 0.40 160.00 160.00 - 0.00 - 0.00 18.00 28.80 188.80",
" Hrs",
" 2 332281AF Hub end play adjustment of both wheel 998714 1.00 400.00 400.00 - 0.00 - 0.00 18.00 72.00 472.00",
" with diagnosis Hrs",
" 3 353415AF Rear axle (2) hub play adjustment 998714 2.00 800.00 800.00 - 0.00 - 0.00 18.00 144.00 944.00",
" Hrs",
" Door check strap on 1st door as rotary 0.20",
" 4 720214AF door,LH replace (with hinged door 998714 Hrs 80.00 80.00 - 0.00 - 0.00 18.00 14.40 94.40",
" removed)",
" 5 353411AF Rear axle hub assembly both remove/ 998714 4.50 1,800.00 1,800.00 - 0.00 - 0.00 18.00 324.00 2,124.00",
" install Hrs",
}

and I’m using this expression
System.Text.RegularExpressions.Regex.Matches( read_pdf_file, “(.\s{10}\d{1,}.\n\s{0,}\d{1,3}\s{3,}\w{0,}\d{1,}\s{3,}.\n.)|(\d{1,3}\s{3,}\w{0,}\d{1,}\s{3,}.\n.)”)

read_pdf_file - Input Variable

Please help me to find a solution.

Hi @Ankit_Chauhan

Are you using Read PDF Text or Read PDF with OCR. If it’s Read PDF with OCR. Try Tesseract OCR Engine if it’s a scanned PDF. If it’s not scanned PDF then Read PDF Text activity should work for you. If possible share the PDF file if the data is not confidential.

Hope it helps!!

1 Like

forget the pdf file, just imaging if I have only input text in same format then how can I solve with Regex?

Hey @Ankit_Chauhan

I would use vb.net script. Something like this:

Dim lines As New List(Of String)
Dim currentLine As String = "{"
Dim itemPattern As String = "\d+\s[\w\d]+[\s\S]+?(?=\d+\s[\w\d]+|$)"
Dim matches As MatchCollection = Regex.Matches(inputString, itemPattern, RegexOptions.Singleline)

For Each match As Match In matches
    currentLine += Environment.NewLine & """" & match.Value.Replace(Environment.NewLine, " ").Trim() & ""","
Next

If currentLine.EndsWith(",") Then
    currentLine = currentLine.Substring(0, currentLine.Length - 1)
End If

currentLine += Environment.NewLine & "}"
outputString = currentLine
1 Like

Hi @Ankit_Chauhan

Could you share the input text file.

Regards

I used this solution but it is not working, you can see the result.

why do you need an input file?
this is an input in String type variable.

"Labor and Job Works All Amount in INR S.Hrs Labor Disc/ Taxable CGST SGST IGST Total S# Labor Code Description of Service SAC Code Rate Rate Rate Labor UOM Charges Rebate Value (%) Amount (%) Amount (%) Amount Amount 1 40101NAF Tyre thread depth check all wheels 998714 0.40 160.00 160.00 - 0.00 - 0.00 18.00 28.80 188.80 Hrs 2 332281AF Hub end play adjustment of both wheel 998714 1.00 400.00 400.00 - 0.00 - 0.00 18.00 72.00 472.00 with diagnosis Hrs 3 353415AF Rear axle (2) hub play adjustment 998714 2.00 800.00 800.00 - 0.00 - 0.00 18.00 144.00 944.00 Hrs Door check strap on 1st door as rotary 0.20 4 720214AF door,LH replace (with hinged door 998714 Hrs 80.00 80.00 - 0.00 - 0.00 18.00 14.40 94.40 removed) 5 353411AF Rear axle hub assembly both remove/ 998714 4.50 1,800.00 1,800.00 - 0.00 - 0.00 18.00 324.00 2,124.00 install Hrs Sub Total 3,240.00 3,240.00 0.00 0.00 583.20 3,823.20 "

@Ankit_Chauhan
I tried once again but it’s not so easy.
I used this code:

Dim output As New System.Text.StringBuilder()
output.AppendLine("{")

Dim relevantContent As String = System.Text.RegularExpressions.Regex.Match(inputString, "\d+\s[\s\S]*?(?=Sub Total)").Value

Dim entries As String() = System.Text.RegularExpressions.Regex.Split(relevantContent, "(?<=Hrs)")

For i As Integer = 0 To entries.Length - 1
    Dim entry As String = entries(i).Trim()
    If Not String.IsNullOrEmpty(entry) Then
        entry = System.Text.RegularExpressions.Regex.Replace(entry.Trim(), "\s+", " ").Trim()
        entry = entry.Replace("""", "\""") 

        If entry.EndsWith("Hrs") Then
            entry = entry.Substring(0, entry.Length - 3).Trim()
            If i < entries.Length - 1 Then
                output.AppendLine($"    ""{entry}"",")
                output.AppendLine("    ""Hrs"",")
            Else
                output.AppendLine($"    ""{entry}"")
                output.AppendLine(""    ""Hrs""")
            End If
        Else
            If i < entries.Length - 1 Then
                output.AppendLine($"    ""{entry}"",")
            Else
                output.AppendLine($"    ""{entry}""")
            End If
        End If
    End If
Next

output.Append("}")

outputString = output.ToString()

and I got result:
@"{
"“1 40101NAF Tyre thread depth check all wheels 998714 0.40 160.00 160.00 - 0.00 - 0.00 18.00 28.80 188.80"”,
““Hrs””,
““2 332281AF Hub end play adjustment of both wheel 998714 1.00 400.00 400.00 - 0.00 - 0.00 18.00 72.00 472.00 with diagnosis””,
““Hrs””,
"“3 353415AF Rear axle (2) hub play adjustment 998714 2.00 800.00 800.00 - 0.00 - 0.00 18.00 144.00 944.00"”,
““Hrs””,
““Door check strap on 1st door as rotary 0.20 4 720214AF door,LH replace (with hinged door 998714"”,
““Hrs””,
““80.00 80.00 - 0.00 - 0.00 18.00 14.40 94.40 removed) 5 353411AF Rear axle hub assembly both remove/ 998714 4.50 1,800.00 1,800.00 - 0.00 - 0.00 18.00 324.00 2,124.00 install””,
““Hrs””,
}”

Is it more less what you wanted achieve?

it’s really helpful but I need to add a break (next line) after numbers (1,2,3,4,5)

do you have any idea about that how can we solve?

@Ankit_Chauhan what you mean add a break?
Can you type one line example for output for clarification?

{
" 1 40101NAF Tyre thread depth check all wheels 998714 0.40 160.00 160.00 - 0.00 - 0.00 18.00 28.80 188.80",
" Hrs",
" 2 332281AF Hub end play adjustment of both wheel 998714 1.00 400.00 400.00 - 0.00 - 0.00 18.00 72.00 472.00",
" with diagnosis Hrs",
" 3 353415AF Rear axle (2) hub play adjustment 998714 2.00 800.00 800.00 - 0.00 - 0.00 18.00 144.00 944.00",
" Hrs",
" Door check strap on 1st door as rotary 0.20",
" 4 720214AF door,LH replace (with hinged door 998714 Hrs 80.00 80.00 - 0.00 - 0.00 18.00 14.40 94.40",
" removed)",
" 5 353411AF Rear axle hub assembly both remove/ 998714 4.50 1,800.00 1,800.00 - 0.00 - 0.00 18.00 324.00 2,124.00",
" install Hrs",
}

Output should be same