Split a large text file into chunks

Hello,
I have a large text file which I convert to csv. This time the file is really too large. Is it possible for me to break it into smaller chunks before converting to CSV.

Regards

How would you like to create chunks? Do you have any specific string till then you want to create chunk…please specify…and give a sample file of the large text file

Hello,

this zip file contains the sample file and the current workflow used in converting text to csv…
The workflow was built by someone else.

The sample file just contains 18451 lines but the main file i am using contains 380,581 lines.
I wouldnt be able to send the main file because it is too large .
The workflow reads the text file and removes unnessecary lines and data in the text.
The it uses a generate data table activity to generate a datable before writing it back to a csv.
For this 18451 lines of text file its going to run successfully , however for the 380,581 lines it breaks at the generate data table activity.
the error is “Job stopped due to …”
The reason for this is because the number of lines are too much.
I am wondering if there is anyway we could count the number of lines on getting to the genrate data table activity and say if lines is greater that this value, split text file into this number of chunks

then generate the tables seperately .

and maybe merge and write to one csv

or write to seperate csv.

I just want the best logic

thanks a lot
Sample20200617-1v4 (2).zip (43.5 KB)

@MasterOfLogic Maybe we can use Merge Datatable to Merge each 50000 lines into a resultant datatable and then Write the Datatable after the data is fully retrieved. Have you tried it ?

1 Like

Actually this data gets reduced to only 7096 lines, We can find this Count before providing it as an argument to the Generate Data Table. So in case of these many lines

Can you check what is the total number of lines present in the text ?

You can use this Split Method to check the Total Number of Lines it got reduced to before Generate Datatable Actvity that is :

Split(text,Environment.NewLine).Count

I just want to know the count so that we can detect if it’s the problem of it being huge data or something else :sweat_smile:

1 Like

[quote=“supermanPunch, post:5, topic:239204”]
Split(text,Environment.NewLine).Count
[/quote] its 380210 :smiley:

@MasterOfLogic Ok. It might be the problem of Huge data :sweat_smile: , We might be able to Split the lines into 50,000 or 20,000 each and Merge it to the Resultant Datatable and then Write the Datatable. Have you tried this ?

1 Like

@supermanPunch, nope I haven’t tried it…how do it do please …lol

@MasterOfLogic Check this workflow :

I have used Split to Convert the Cleaned Text to Array of Strings, So that we can take Chunks of Data and Merge it, and then write the result at the last.

I have commented an Assign which I think is not needed and have modified/added activities to get the output. I have used chunk size as 2000. You can use 20,000 or 50,000.
But first test if this workflow works properly, and then use it on the actual data.

Sample20200617-1v4 (2).zip (43.5 KB)

1 Like

Hello,Please the workflow you sent is the older one :smiley: (my workflow)… i think you made a mistake while attaching files… please resend

@MasterOfLogic Yes. Sorry for that :sweat_smile: . Check this one below :
Sample20200617-1v4.zip (44.2 KB)

1 Like

@supermanPunch Hello, its breaking here

@MasterOfLogic Ok. I understood why it’s happening :sweat_smile: , We have to move those two Actvities After the Assign of textArray. I should have thought of that actually. Check the workflow. It should be able to get you the right output :
Sample20200617-1v4.zip (44.2 KB)

2 Likes

@supermanPunch

it breaks here now, output data table

@MasterOfLogic Ok. Can you Remove those two Activites and try Writing the result Datatable using Write CSV Activity ?

1 Like

@supermanPunch okay i just did that and it wrote successfully but the data is incomplete … it has only 210 rows in the csv instead of 380210 rows

@MasterOfLogic Can you check the resultDT.Rows.Count before Writing the CSV file?
It should have atleast written 7096 rows :sweat_smile:

[quote=“supermanPunch, post:17, topic:239204”]
resultDT.Rows.Count
[/quote] @supermanPunch ohh yes resultDT it had “380209” rows. my mistake was i was write dt to the csv instead of result DT.
But now on writing resultDT to the csv it broke at the write to csv and was only able to write 304616 rows successfully.
it appears now about 75,593 rows are missing.

in addition my other concerns are i learnt data tables take a maximum of 16,777,216 rows.
and csvs can only hold 1048576 rows… what if the rows get larger

@MasterOfLogic Then maybe instead of Excel files or CSV Files you may need to use a Database. Also if you think that the Datatable Rows Count might even go higher you have to again check the row count is considerably lesser than the Limit, then Write the Datatable to the Database if it is high and then reset the datatable to an Empty one. Basically you’ll need to handle all types of errors :sweat_smile:

Also you can try writing the Datatable to an xlsx file and check if all rows are present in it.

1 Like

@supermanPunch ohh cool, are you saying using the OleDB connection to excel …please help me …lool… i usually use something like this to read from excel using data base connection “Provider=Microsoft.ACE.OLEDB.12.0;Data Source=”+ExcelFilePath+“;Extended Properties=”“Excel 12.0 Xml;HDR=YES”“” could i use this for write ?? or