Saving Extracted Text from PDFs in JSON Format Using UiPath

Hi there,

I am currently working on an UiPath project where I need to extract text from multiple PDF/JPG files, process the extracted text, and then save it in JSON format in separate text files. I’ve successfully managed to extract the text from the PDFs and save them as individual text files, but now I’m looking to convert the extracted text into JSON format and save it in the text files. Each line of text should be represented as a separate JSON object in the file.

Could anyone provide guidance on how to achieve this? Specifically, I would like to know how to transform the extracted text into JSON format and how to correctly structure and save the JSON data in separate text files for each PDF. Any insights or sample code would be greatly appreciated!

This is the desired output format per JPG/PDF:

{
“Location”: “Gotham City”,
“Company”: “Wayne Enterprises”,
“VAT Number”: “123456789”,
“Slip Type”: “Type B”,
“Date Time On Slip”: “2023-08-16 14:30:00”,
“Applicants Name”: “Bruce Wayne”,
“ABC Code”: “SS694200”,
“Type Category”: “Tourism”,
“Sub Category”: “Leisure”,
“AAA Fee”: 6.00,
“ABCD Fee”: 9.00,
“Service Fee”: 4.00,
“SMS Fee”: 2.00,
“VAT”: 69.42,
“Total ABCDE Fees Non Vatable”: 69.00,
“Total ABCDEF Fees Including VAT”: 420.69
}

Thanks Again!

Create jsonObj as datatype Newtonsoft.Json.Linq.JObject and initialize (in the default) as New JObject.

For each value you want to add:

jsonObj.Add(“Location”,“Gotham City”)

then jsonObj.ToString will give you your desired output.

It’s easiest to do it in Invoke Code:

image

Invoke Code make sure the argument is in/out:

output:

image

(Actually, I accidentally left jsonObj as an In argument, and it still worked. I suspect this is because the object exists outside the Invode Code and therefore the .Add still updates it outside)

Hi @Ray_Shadow ,

We would also need to know how the format of the Extracted Text is and have you stored it as a Key-Value pair or are you yet to Extract the Values accordingly mentioned.

If already extracted, how is it stored? Datatable or Dictionary ?

If yet to Extract the necessary details, we would ask you to provide us with a Sample of the Extracted text so that we can check on each value extraction and the necessary steps to bring it to the require format.

Here is a dummy Receipt example:

Gotham City
Wayne Enterprises
VAT123456789
Type B
2023/08/16 14:30:00
Mr.Bruce Wayne
SS694200
Category: Tourism
Sub Category: Leisure
Sub Type Category: Leisure
AAA Fee: 6.00
ABCD Fee: 9.00
Service Fee: 4.00
SMS Fee: 2.00
VAT: 69.42
Total ABCDE Fees Non Vatable: 69.00
Total ABCDEF Fees Including VAT: 420.69

Hope that helps

Please try this

Dim pdfDocument As New iText.Kernel.Pdf.PdfDocument(New iText.Kernel.Pdf.PdfReader("PDFFilePath"))
Dim form As iText.Forms.PdfAcroForm = iText.Forms.PdfAcroForm.GetAcroForm(pdfDocument, True)
Dim fields As IDictionary(Of String, iText.Forms.Fields.PdfFormField) = form.GetFormFields()
For Each fieldName As String In fields.Keys
    Dim field As iText.Forms.Fields.PdfFormField = fields(fieldName)
    Dim value As String = field.GetValueAsString()
    Console.WriteLine("Field name: " & fieldName & ", value: " & value)
Next
pdfDocument.Close()

Cheers

Hello sir I would like to know how to convert the pdf into json format please enlighten me with the packages and sequences. This is important for me right now. ill rephrase my question please enlighten me on “how to convert pdf (its information) into JSON format”??
Thank you!