We have a large excel file that we need to process (over 4 million records, attached below is a small sample). The aim of our automation is to search through all the records that have negative value. And for each one of these:
Take its correspondent TRX_DATE, ITEM_ID & ORGANIZATION_ID and then search in ALL PREVIOUS THREE (3) DAYS records if it corresponds to another record within the same ORGANIZATION_ID & same ITEM_ID & same PRIMARY_ QUANTITY (but here positive)
If no such records are found, add the original record (the one with a negative value) to a new data table.
And here is a small sample: small_sample.xlsx (13.1 KB)
(The original file is a csv, but it didn’t allow me to upload it here as csv, so converted to xlsx)
What do you think is the right approach to do? (taking into consideration the 4 million record that we have)
Ok there are times when I do envy the forum members for the fantastic challenges they get to work on, like this one!
Nonetheless, here are two things I would evaluate before using excel as input source to an automation.
1. Data binning
Even before we start, I would always advice that any table should always have a unique reference column. May be an ID column which is an auto incremented column. Another advantage will be the ability to containerize your data. This technique is called Data Binning. Using Data Binning you can think of your 4 million data points as may be 100000 and that all the logic you will use will be repeated 40 times.
When you read tables in UiPath they are stored in memory and to ensure that your robot PC does not just hogging the storage disk to compensate for RAM, data binning is the way to go for large dataset. It does not make the code faster, but it improves stability of your automation. You could also try this approach with your Linq expression.
2. Input excel data into a SQL table
Another alternative is to upload your excel file contents to a SQL table and perform the same operations via an SQL Procedure. SQL performance will usually be superior than excel/UiPath. In the end, the cleaned data can be retrieved by the bot/robot to do further processing (Robot calls the stored procedure).
I would still go by this as mentioned by @jeevith
Instead of having excel with such a large dataset as input source we can look for possibilities to hold those data in a DATABASE
It will be way more easier to get or manipulate data being a sql database
But if we don’t have that option then we can try with other approaches
Have a view on this examples with CSV and Datatables maybe this will help:
@jeevith Thank you for your thoughtful answer, will definitely use your advice.
@Palaniyappan Thank you as well for the additional methods and info!
For future reference:
Adding to my original method, we tried also to reduce the load on the robot by dividing the original csv into multiple csv. (each csv contains records for a specific day)
This helps us minimize the LINQ expression by removing the date condition. (Much faster!)
The results were quite good (Iterating 4000 records over 1.5 million other records took 24 minutes). BUT, if we still optimize to the following syntax to the following: