Efficient way to filter large data files

Karan_Katle · May 13, 2022, 7:21am

Hi Team, I have 5 files
1 is repo file containing 1-2 lakh of rows each row represents an individuals complete information
2 is input file containing 10K-15k rows each row represent information of an individual to update
3 is Updated File - blank initially
4 is Reference File - blank initially only headers are present
5 is Non Updated File - blank initially

Now for each input row, I have to check whether a unique individual is present in repo file which I am doing by DT.SELECT method. If no, then need to add it on non updated file and reference file which I am doing building Data table and appending to CSV.
If a unique individual is present in repository file then I need to update a row in repository file based on input file. (Note: No. of columns in repo file and input files are also different. Only 5-6 columns are common and need to update information of only 2 columns). Also I need to take that updated row in repo file and append it to Updated File and reference file.
I am doing this by selecting a row from repo file, updating value and adding the row to DB and reference file. This method will not update the row in repo file. So for updating that I am looping through each row in repo file (1Lkh-2Lkh rows), if the student information matches, updating that row information and then writing all the rows in CSV file. Means right now I am reading and writing 1 Lkh-2Lkh rows 10K-15k times which is very time consuming
Can any one please help me with efficient way for this.
I thought writing a 1Lkh-2Lkh rows in a single go will save a lot of time, then in that scenario I need to remove a row first from reference file which is again time consuming.

Smars · May 13, 2022, 9:18pm

I don’t know if it’s going to help you much, but there are two possibilities to handle large data files:

Use C# or Python Code
If you know C# it is possible through Invoke Code to write C# code in UiPath
Use LINQ statements
It is similar to SQL Queries, you can get started here:
[HowTo] - Exploring the LINQ Universe (VB.Net)
Process optimization
“For each row” can be very time consuming, that’s why using LINQ statements is better.
With the right statement you get the desired row and update the specific row by using the index.

Side note:
If you are running multiple robots you should divide your input file accordingly.

Nithinkrishna · May 14, 2022, 2:21am

Hey @Karan_Katle

I see your files normal and okay.

You can process it with Excel activities.

If still you want something, you can go with LINQ!

Thanks
#nK

Karan_Katle · May 14, 2022, 6:47am

@Smars @Nithinkrishna Thanks for the suggestion, I tried using Linq query, removed loop and reduced a processing time from 13 sec to 9 sec per record which is approx 25% of time consuming.

One more thing I observed. Out of 9 sec reading repo csv file itself is taking 5 sec of time. Our requirement is need we need to read the repo csv in a loop, so processing of complete transaction now taking 9 sec out of which 4-5 sec is consumed for reading repo csv only. Is there any efficient way to do it?

Nithinkrishna · May 14, 2022, 9:13am

Hey @Karan_Katle

You may try using file.ReadLines method but still I feel may not be much difference.

Thanks
#nK

Smars · May 14, 2022, 9:44am

As I mentioned before you can split the files into smaller datasets and join them after you processed all repo files

Topic		Replies	Views
Handling Larger excel files Studio studio	4	62	October 19, 2024
How to read through two large .csv files in an efficient way? Studio	14	1444	February 1, 2022
Efficient Data Table Query Help activities	7	4003	June 7, 2019
How to make changes on a lot of rows without using For Each activity? Activities datatable , excel , uiautomation , activities	12	1986	October 12, 2021
Reducing Time Activities excel , activities , question	3	1625	January 13, 2022

Efficient way to filter large data files

Related topics