I had one pdf with 100 pages which contains data of cars compnay. can i split into 4 documents based on page no and company name?

I had one pdf with 100 pages which contains data of cars compnay.

Tata data from page 1 to 7
Maruthi data from page 8 to 22
Tata from page 23 to 26
Kia from 27 to 99
Maruthi on 100 page.

Can I split into 4 documents with correct pages
Doc1 as 7 pages
Doc2 as 15 pages
Doc3 as 77 pages
Doc4 as 1 page

Hi @MitheshBolla ,

Have you tried the Extract Pdf Page Range Activity in UiPath.PDF.Activities Package.

We can Specify the Page Ranges to be Extracted and Convert those Pages into a Single Document.

Could you Check it out and let us know if it is the requirement.

1 Like

To split one pdf into multiple i am getting data from pdf where on first page it tell
Page 1-9.
SO i need to split 9 pages in one pdf.
on 10 page it will be like 1-4
so page 10,11,12,13 need to be in one pdf

@MitheshBolla ,

The First Page I assume contains the details of the Page Ranges to Be Extracted.

In that case, We can read the First Page of PDF , then Use Regex To Extract the Page Range.

We would require a Sample data of the Pdf after it is in text form for enabling us to help you further.

1 Like

If i use microsfot, or tessert or omni ocrs the data is not extracting properly, which ocr is best to extract pdf data with ocr?

@MitheshBolla

If the PDF is a Scanned Pdf i.e it is containing images then it would be Difficult to extract data if the quality of the image is very low.

However, Give a Try using the UiPath OCR and Let us know. You may need to activate Enterprise Trial in Orchestrator and it needs the Document Understanding API Key.

1 Like

yes i am having enterprise and pdf is low, can i read pdf with ocr activity ? without using document understanding concept?

@MitheshBolla , We don’t need to utilise the full DU Concept.

We just need the Extraction of data using the OCR.

As you have already mentioned that some of the OCR’s do not give the output as expected, the remaining OCR to Try is the UiPath Document OCR.

Also, when using other OCR’s, Keep the Profile as Scan and try with Different Scale values to Check if there are any better results.
image

1 Like


this package right

@MitheshBolla Yes. Do give it a Try and let us know what is the outcome.

1 Like

Yes this was 90 % extracting well, but the data is displaying in 3 lines
image

date is coming in between page 1 of 2 as they both are in next line

@MitheshBolla Could you Provide us with this data in Text file?

Also try with other Pages where you have to Extract Page Range and let us know if it is in the same format ?

1 Like

its changing when document has another box


@MitheshBolla ,

Does the Page Range always appear in the Table?

Also, Is it possible to share the Pdf file ?

1 Like

i had sent u inbox seperatly

Hi @MitheshBolla ,

Apologies for the Late Reply.

It was possible to Match the Page Ranges in multiple pages of the Images. But it is also need to be verified whether Regex Expression used will be able to detect it in other Samples as well.

Below is the Regex :

(\d+)\s*of\s*(\d+)

image

You can use the Matches Activity to get all the Matches. You can then use it’s output to Check the Total Count or the Values that were matched using the Below Expression :

String.Join(",",PageRangeMatches.Cast(Of Match).Select(Function(x)x.Groups(1).Value.ToString+"-"+x.Groups(2).Value.ToString))

The PageRangeMatches variable is the Output of Matches Activity.

Also we can note that, the Page Range doesn’t continue according to the Number of Pdf Pages but rather each Split Document Page Starts from Page 1. We also need to confirm whether this is the case for all the documents that you receive.

1 Like

Yes ,all documents are like that , and with your regex i got the value.
i am using extract pdf with page range . first pdf is extracting correct , and from 2nd its extracting wrong .

Hi @MitheshBolla ,

Could you provide the Extracted data from PDF in Text files, so that I could confirm from my Side that the Solution Developed works good for all similar cases.

1 Like

"(\d)\of\s(\d)|(\d)of(\d)|(\d)\sof(\d)|(\d)of\s(\d),

Some times its
“1 of 2”
“1of2”
“1 of2”
“1of 2”

Hi @MitheshBolla ,

Apologies for the Late Reply.

I did manage to make a Workflow which would Split the Pdf Pages in the manner required.

But the problem is still due to the Extraction.

The Extraction is not fully capable of detecting the Page Ranges, may be due to the Quality of the Images.

In the case of 4th Page of the PDF, It didn’t detect the Page Range as it identified 1 as I and of is not detected by OCR.

Hence, If we are not able to extract the details properly, we may not be able to perform the Splitting properly as well.

The Below is the Extracted Text from the 4th Page of PDF. As you can see, the extraction of the Page Range is not quite well.

Below is the Workflow Developed so far. It gives out an Error now, Since the Page Ranges do not get Extracted properly.
Split_PdfPages_ByPageNo.zip (1.4 MB)

1 Like