Read PDF text line by line instead of block by block

Hello,

when using the activity read PDF text sometimes the text is acquired line by line. Sometimes the text is acquired in blocks.
In windows this behaviour depends on the choice of the default application for pdf.
If edge is selected, the text sometimes is recognized in blocks.
If adobe is selected, the same document which is recognized in blocks can be read line by line now.

Is it possible to make this kind of settings in UiPath as well / select adobe to open or read the pdf?
I would like to read the pdf text line by line, instead of block by block.

Best regards,
Jo

Hi @Joachim ,

Could you let us know what do you mean by Block by Block and if you can showcase an example for that ? There are differences observed when we set the PreserveFormatting property to True/False.

But if it is not including the PreserveFormatting option, then for us to conclude further an example/sample would be helpful.

Hello superman,
I have added one screenshot for block by block (with edge as standard) and one for line by line (with adobe)



In my memory i have tried the preserveformatting option and it did not work, but i will verify this.
Best regards,
Jo

Hello superman,
does not work with preserveformatting.
So where is defined how a pdf document is opened in UiPath (edge, adobe, something else …?)
Or is there another option to read text line by line?
Best regards, Jo

@Joachim ,

We normally use the Read PDF Text activity to read the text data from the PDF document. We wouldn’t need to use Edge or Chrome for opening the document and then perform the read operation as it can be done in the background.

Have you tried using Read PDF Text Activity without PreserveFormatting enabled, it will give the output as a String/Text which you could write it to a text file.

Also, If you could provide us with a PDF sample relative to your input document and also the expected output for that document, then we should be able to clearly understand the issue that you are facing and we’ll be able to suggest the appropriate actions.

Hello superman,
thanks for your reply.
The normal way to open the pdf document is, as you mentioned, with read PDF text and without preserveformatting.
Regarding a test document, i need to check this, because all documents with this issue contain personal data that must not be disclosed.
Best regards, Jo

Hello superman,
i have found a document that is block by block when using adobe and line by line with edge (just the other way round).
But maybe it gives you an idea regarding my problem.
Best regards, Jo
testdocument.pdf (881.1 KB)

@Joachim ,

Could you let us know what should be the Expected Output in your case when this document is read as text ?

Below is the Output /format text when used Read PDF Text Activity. (Without PreserveFormatting enabled)

Below is the Output /format text when used Read PDF Text Activity. (With PreserveFormatting enabled)

Hello superman,
I hand the text to a vba macro.
the vba macro splits the text into lines (Cr10).
PDF documents that are block by block are lacking the linebreak.
Best regards, Jo

@Joachim ,

If the resultant needed is the line by line output, then We can use Split based on NewLine and check if we are able to get the data into separate lines. Alternately a Regex Split can also be checked.
Check if any of the below Expressions could work for your case, on splitting the text into lines (String Array)

Split(strInput,Environment.NewLine)
Regex.Split(strInput1,"\r?\n")

Here, strInput is the PDF Text output.

image

Hello superman,
i think i started with Split(strInput,Environment.NewLine) but then had difficulties handing the string array to the vba macro as a parameter.
Best regards, Jo

@Joachim ,

I believe there is more to work here than just getting the text and Splitting into arrays.

Could you provide us an overview of what is to be done with the PDF data text ?

Also, would want to know what is the VBA macro used for, Only for the Split ?

Let us know your end output as to how you would want it to be received and we can work towards providing the solution if possible.

Hello superman,
due to lack of a document understanding licence the the vba marcro basically parses the pdf text.
It would be great if you could give me a hint how to hand a string array (after the split) to the vba macro.
Like i mentioned before, i started with the split but then failed to hand the string array to the vba macro.
Best regards, Jo

@Joachim ,

Maybe the below post would help you :

Hello superman,
thanks for the link. It helps passing the string array to the macro.
The only issue i have now is, that only the first 30 elements of the array arrive in the macro.
And then if you could give a hint how to hand a string array back to UiPath from the macro, would be gread.
Best regards,
Jo