I had one pdf with 100 pages which contains data of cars compnay.
Tata data from page 1 to 7
Maruthi data from page 8 to 22
Tata from page 23 to 26
Kia from 27 to 99
Maruthi on 100 page.
Can I split into 4 documents with correct pages
Doc1 as 7 pages
Doc2 as 15 pages
Doc3 as 77 pages
Doc4 as 1 page
Have you tried the
Extract Pdf Page Range Activity in
We can Specify the Page Ranges to be Extracted and Convert those Pages into a Single Document.
Could you Check it out and let us know if it is the requirement.
To split one pdf into multiple i am getting data from pdf where on first page it tell
SO i need to split 9 pages in one pdf.
on 10 page it will be like 1-4
so page 10,11,12,13 need to be in one pdf
The First Page I assume contains the details of the Page Ranges to Be Extracted.
In that case, We can read the First Page of PDF , then Use Regex To Extract the Page Range.
We would require a Sample data of the Pdf after it is in text form for enabling us to help you further.
If i use microsfot, or tessert or omni ocrs the data is not extracting properly, which ocr is best to extract pdf data with ocr?
If the PDF is a Scanned Pdf i.e it is containing images then it would be Difficult to extract data if the quality of the image is very low.
However, Give a Try using the UiPath OCR and Let us know. You may need to activate Enterprise Trial in Orchestrator and it needs the Document Understanding API Key.
yes i am having enterprise and pdf is low, can i read pdf with ocr activity ? without using document understanding concept?
@MitheshBolla , We don’t need to utilise the full DU Concept.
We just need the Extraction of data using the OCR.
As you have already mentioned that some of the OCR’s do not give the output as expected, the remaining OCR to Try is the
UiPath Document OCR.
Also, when using other OCR’s, Keep the Profile as
Scan and try with Different
Scale values to Check if there are any better results.
@MitheshBolla Yes. Do give it a Try and let us know what is the outcome.
Yes this was 90 % extracting well, but the data is displaying in 3 lines
date is coming in between page 1 of 2 as they both are in next line
@MitheshBolla Could you Provide us with this data in Text file?
Also try with other Pages where you have to Extract Page Range and let us know if it is in the same format ?
its changing when document has another box
Does the Page Range always appear in the Table?
Also, Is it possible to share the Pdf file ?
i had sent u inbox seperatly
Apologies for the Late Reply.
It was possible to Match the Page Ranges in multiple pages of the Images. But it is also need to be verified whether Regex Expression used will be able to detect it in other Samples as well.
Below is the Regex :
You can use the Matches Activity to get all the Matches. You can then use it’s output to Check the Total Count or the Values that were matched using the Below Expression :
PageRangeMatches variable is the Output of
Also we can note that, the Page Range doesn’t continue according to the Number of Pdf Pages but rather each Split Document Page Starts from Page 1. We also need to confirm whether this is the case for all the documents that you receive.
Yes ,all documents are like that , and with your regex i got the value.
i am using extract pdf with page range . first pdf is extracting correct , and from 2nd its extracting wrong .
Could you provide the Extracted data from PDF in Text files, so that I could confirm from my Side that the Solution Developed works good for all similar cases.
Some times its
“1 of 2”
Apologies for the Late Reply.
I did manage to make a Workflow which would Split the Pdf Pages in the manner required.
But the problem is still due to the Extraction.
The Extraction is not fully capable of detecting the Page Ranges, may be due to the Quality of the Images.
In the case of 4th Page of the PDF, It didn’t detect the Page Range as it identified 1 as I and of is not detected by OCR.
Hence, If we are not able to extract the details properly, we may not be able to perform the Splitting properly as well.
The Below is the Extracted Text from the 4th Page of PDF. As you can see, the extraction of the Page Range is not quite well.
Below is the Workflow Developed so far. It gives out an Error now, Since the Page Ranges do not get Extracted properly.
Split_PdfPages_ByPageNo.zip (1.4 MB)