Benchmarking Extract Transform Load (ETL) pipelines in UiPath

jeevith · March 1, 2023, 1:02pm

Target audience: CoE Leads, RPA Solution Architects and RPA Developers

Reasons to perform such benchmarks in UiPath

The UiPath forum gets a lot of queries regarding the use of Linq queries to mainpulate datatables. I see many developers have fallen for the myth that Linq queries are more performant than the standard Filter Datatable activity. I started of investigating the two aforementioned techniques, but I did want to share my experience with other tools which can outperform any standard UiPath datatable activity or Linq query. I thank @postwick for taking the lead and starting this discussion way back in 2021, and @ppr for helping us bust the myth.
We are developers. If we do not adapt and learn new technologies we are redundant. If you or someone you know is looking for the best performing ETL (Extract, Transform, Load) flow, then they have to evaluate new tools to improve their efficiency.
I started learning Rust and came across Polars crate and how good of a tool it can be for datatable manipulations. The developer of Polars ( Ritchie Vink) wrote about his benchmarking results in this blog and I was sold. I had to try it out.
@StefanSchnell 's integration tutorial also motivated me to show how new tools can be integrated with UiPath without any prebuild libraries or official integrations. Through this post, I can contribute two more integrations (DuckDB and Rust) in addition to the ones mentioned in his tutorial
Even if one of you readers gets motivated and uses some of these tools mentioned here, someday it might help reduce a large amount of robot processing time. There is a lot of potential time savings (without compromising maintainability and robustness) by using some of the newer tools in your automations.

Disclaimers

I am starting small. The first version of this benchmark was created during my vacation and on an old laptop on an Intel(R) Core™ i5-4210U CPU @ 1.70GHz with a mere 4GB RAM. Therefore, I have chosen the first benchmark to only test a filter transformation. There is no doubt, this test results would be blazing fast on current multi-core processors, specially the tools which take advantage of new high core count CPU architectures.
I am not advocating against using Linq queries or Filter Datatable activity in UiPath Studio, but I do what the reader to know about newer techniques and allow them to make an informed decision. They (Filter Datatable and Linq queries) are still the easiest to start with as they are fully supported by UiPath, but if you are clearly focused only on performance, then they are not good enough in my view.
In hindsight, I could have used the Testing Project template, but I started with a Windows Standard project in UiPath and did not wish to change it. However, this has not affected the time measurements made in this benchmark.
Although the benchmark files to be included (after some weeks of publishing this tutorial) are well thought through, do not use them in any serious automation. They do not contain any error handling capabilities. I will wait for the community feedback before publishing the benchmark xaml files. May be I can even publish them on UiPath Marketplace and make additional tutorials posts on how to integrate each of them in UiPath.
Csv file as input file is not the newest and best format to test such benchmarks. But since UiPath has out of the box support to read and write csv file, I chose to test on Csv files. Parquet file format is the current go to format in the data science community.
No PC performance monitoring was carried out when the test ran. So I do not have data on the processor usage and ram utilization. May be a grafana server polling the PC performance may have helped in analyzing this.

Creating the input data

To imitate realistic datatables, I used the Faker library in Python. Faker is a handy module which can create fictive data in a fast and easy way. In this case, I iterate over a list of required rows in a datatable and create them in a for loop. To ensure easy integration with UiPath I chose the csv file format.

Python Code - FakeDataGenerator.py

from faker import Faker
import pandas as pd

fake = Faker()
NumRows = [10000, 100000, 250000, 500000, 1000000, 2000000] 
for num in NumRows:
    profileData = [fake.profile() for i in range(num)]
    df = pd.DataFrame(profileData)
    df.to_csv("{0}_input.csv".format(num), index=False)
    print("Saved the file {0}_input.csv".format(num))

print("All dummy data created")

The above code generates the input datatables with row counts of 10000, 100000, 250000, 500000, 1000000, and 2000000. Below is the sample input data used in this benchmark.
Input Data Shape (RowCount * 13 columns)

job	company	ssn	residence	current_location	blood_group	website	username
Print production planner	Klein-Caldwell	424-79-6551	252 Davis Forks Apt. 948
Kelleyhaven, NC 18661	(Decimal(‘-86.2965425’), Decimal(‘-41.336330’))	B+	[‘http://www.bird.com/’, ‘https://www.miller.net/’]	watsonsteven	Kevin Ball	M	34641 Glenn Oval Suite 696
Lake Shellyhaven, ID 68870	erin57@hotmail.com	2011-08-28
Research officer, political party	Benton, Thompson and Powell	172-32-9022	5401 Lewis Ville Apt. 542
South Brett, MD 72875	(Decimal(‘9.974042’), Decimal(‘-179.607076’))	AB-	[‘https://fuller.com/’, ‘http://kaiser-parker.com/’, ‘https://www.garcia.info/’, ‘http://hill-vasquez.net/’]	tiffanyleonard	Linda Jones	F	83606 Richardson Spring
Scottborough, FM 89592	tammywashington@gmail.com	1973-07-06
Statistician	Thomas, Carr and Medina	384-30-9782	4213 Connor Street
Petersport, DC 54921	(Decimal(‘-60.1181895’), Decimal(‘32.682063’))	O-	[‘https://phillips.net/’]	jimmy18	Manuel Boone	M	14996 Brad Village
South Jessicaport, MI 91834	shellyfrancis@yahoo.com	1981-07-14

The filter benchmark

The filter condition used in all approaches translates to the same SQL logic. Return rows where the input table column job has the value 'Social worker'.

SELECT * FROM  TESTDATATABLE
WHERE job='Social worker';

The Start and Stop timer activities were used to time only the unit under test (ingesting Csv data, transforming it and saving the result to a new file). This way we remain unbiased and objective while testing these approaches. In short, apples to apples comparison.

The execution was performed via the UiPath UiRobot.exe and not run from Studio, this was done to avoid any debugging lags.

C:\Users\%USER%\AppData\Local\Programs\UiPath\Studio\UiRobot.exe  execute --file "C:\Users\%USER%\Documents\UiPath\_Nugets\BenchmarkingETLApproachesinUiPath.1.0.3.nupkg"  --input "{'in_InputDataFolder':'C:\\Users\\%USER%\\Documents\\UiPath\\InputData'}" --entry "FilterBenchmarker.xaml"

To test repeatability, 3 runs were executed. Lower the bar (greener the bar), the better performant is the approach to filter a datatable.

In total the filter benchmark tests 7 different approaches / tools.

UiPath Filter Datatable activity
Linq Query in an Assign activity
InputData.AsEnumerable.Where(Function(r) r.Item("job").ToString.Equals("Social worker")).CopyToDataTable

Running Python script using Pandas library

import pandas as pd
import argparse

parser = parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()

# Read csv
df = pd.read_csv(args.InputCSVPath)
# Apply filter
filtered_df = df[df.job == "Social worker"]
# save output csv
filtered_df.to_csv(args.OutputCSVPath)

Running Python script using Polars library

import polars as pl
import argparse

parser = parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()

# Read csv
df = pl.read_csv(file=args.InputCSVPath) 
# Apply filter
filtered_df = df.filter(pl.col("job").str.contains("Social worker"))
# save output csv
filtered_df.write_csv(file=args.OutputCSVPath)

Running a PowerShell script

Import-Csv -Path 'INPUTFILE' | Where { $_.job -eq 'Social worker'} | Export-Csv -Path 'OUTPUTFILE' -NoTypeInformation

Running the DuckDB executor with using SQL command
Read more about why DuckDB is great for analytic transformations
```
duckdb.exe -c "CREATE OR REPLACE TABLE TABLENAME AS SELECT * FROM read_csv_auto('FULLFILEPATH'); COPY (SELECT * FROM TABLENAME WHERE job='Social worker') TO 'OUTPUTPATH' (HEADER, DELIMITER ',');"
```
-c tag tells duckdb.exe to run command directly in CLI without opening a new in-memory database

Running a Rust program using Polars crate

use polars::{prelude::*, lazy::dsl::col};
use std::env;

fn main() {
    // Setting arguments for input and output csv
    let args: Vec<String> = env::args().collect();
    let csv_path = &args[1];
    let result_path = &args[2];

    // Reading csv files into a polars dataframe
    let df = CsvReader::from_path(csv_path).unwrap().finish().unwrap();

    // use the predicate to filter
    let predicate = col("job").eq(lit("Social worker"));
   
    let filtered_df = df
                    .clone()
                    .lazy()
                    .filter(predicate )
                    .collect()
                    .unwrap();

    let mut output = filtered_df;
    
    // Writing to a csv file
    let mut file = std::fs::File::create(result_path).unwrap();
    CsvWriter::new(&mut file).finish(&mut output).unwrap();
}

Example command to be used in Start-Process. Remember to use cargo build --release or cargo run --release arguments this will compile the project and then use the release folder to locate the executor (.exe) file see documentation here.

C:\Users\%USER%\Documents\RustProjects\data_processer\target\release\data_processer.exe  'C:\Users\%USER%\Documents\UiPath\InputData\2000000_input.csv'  'OUTPUT.csv'

PC Specifications

Specification	Value
Name	Intel(R) Core™ i5-4210U CPU @ 1.70GHz
NumberOfCores	2
NumberOfLogicalProcessors	4
Architecture	x64
CurrentClockSpeed	1700
Availability	RunningOrFullPower
OsName	Microsoft Windows 10 Home Single Language
CsPhyicallyInstalledMemory	4 GB

Results for a single filter operation

This benchmarking project results in an excel file contain all the runs with different tools. To analyze and plot the results I used Python.

Python Code - BenchmarkPlotting.py

import polars as pl
import pandas as pd
import altair as alt

df = pl.read_excel("FilterBenchmarkResults.xlsx", sheet_name="Results")
# Transforming data and creating new columns
df = df.with_columns([
    (pl.col("FileName").str.split(by="_").arr.get(0).cast(pl.Int64)).alias("RowCount"),
    (pl.col("Approach").str.split(by="FilterUsing").arr.get(1)).alias("Tool"),    
])
df = df.with_columns([
  (pl.col("ExecutionTimeSeconds").rank(method="max").over(["RowCount"]).alias("Rank")),  
])
# Sorting the data for better readability
df = df.sort(["RowCount","ExecutionTimeSeconds"])

# as altair does not support Polars dataframes converting to pandas dataframe
final_df= df.to_pandas()

# Plotting the results in a altair bar chart
chart = alt.Chart( ).transform_calculate(
        ToolTip="datum.Tool+'='+datum.ExecutionTimeSeconds").mark_bar().encode( 
        x=alt.X('Tool:N', axis=alt.Axis(labelAngle=-45, title=None)),
        y=alt.Y('ExecutionTimeSeconds:Q'),
        color=alt.Color('Rank:Q',scale=alt.Scale(scheme="blues", reverse=True)),
        tooltip="ToolTip:N",
        order='Rank:Q',
    )
    
text = chart.mark_text(
        color = 'black',
        dy= -5
    ).encode(
        text = alt.Text(
            'ExecutionTimeSeconds:Q',
            format = ',.0f')
    )
    
fig = alt.layer(chart, text, data=final_df).facet(
    column=alt.Column(
        'RowCount:Q', 
        )
    ).configure_facet(
    spacing=20
).configure_axis(
        grid=False
    ).configure_view(
        strokeWidth=0.4
    ).interactive()

fig.save('BenchmarkFilterDatatable.html')

Run 1 - benchmark results table

Tool	RowCount	ExecutionTimeSeconds	RowsKept	Rank
DuckDB	100000	0.863817	184	1
Rust	100000	0.992763	184	2
PythonPolars	100000	1.62365	184	3
FilterDataTable	100000	2.28531	184	4
Linq	100000	2.60637	184	5
PowerShell	100000	6.74261	184	6
PythonPandas	100000	7.11996	184	7
DuckDB	250000	1.37164	413	1
Rust	250000	1.64754	413	2
PythonPolars	250000	2.01796	413	3
PythonPandas	250000	4.97285	413	4
FilterDataTable	250000	7.63237	413	5
Linq	250000	8.04975	413	6
PowerShell	250000	14.5997	413	7
Rust	500000	1.99611	762	1
PythonPolars	500000	2.02914	762	2
DuckDB	500000	2.10029	762	3
PythonPandas	500000	7.38287	762	4
Linq	500000	11.5408	762	5
FilterDataTable	500000	11.6786	762	6
PowerShell	500000	27.767	762	7
Rust	1000000	2.04737	1511	1
PythonPolars	1000000	3.09737	1511	2
DuckDB	1000000	3.74764	1511	3
PythonPandas	1000000	13.0274	1511	4
FilterDataTable	1000000	23.8323	1511	5
Linq	1000000	23.9049	1511	6
PowerShell	1000000	55.9658	1511	7
Rust	2000000	4.59196	3126	1
PythonPolars	2000000	5.40843	3126	2
DuckDB	2000000	7.18215	3126	3
PythonPandas	2000000	25.0529	3126	4
FilterDataTable	2000000	49.568	3126	5
Linq	2000000	57.2253	3126	6
PowerShell	2000000	109.237	3126	7

Run 2 - benchmark results table

Tool	RowCount	ExecutionTimeSeconds	RowsKept	Rank
DuckDB	100000	0.933689	184	1
Rust	100000	1.17638	184	2
PythonPolars	100000	1.32587	184	3
FilterDataTable	100000	2.0938	184	4
Linq	100000	2.871	184	5
PythonPandas	100000	4.74992	184	6
PowerShell	100000	6.53177	184	7
Rust	250000	1.26209	413	1
PythonPolars	250000	1.70902	413	2
DuckDB	250000	1.73422	413	3
PythonPandas	250000	4.72671	413	4
Linq	250000	5.84544	413	5
FilterDataTable	250000	6.75782	413	6
PowerShell	250000	13.9554	413	7
Rust	500000	1.62974	762	1
PythonPolars	500000	1.92046	762	2
DuckDB	500000	2.87362	762	3
PythonPandas	500000	6.79275	762	4
Linq	500000	10.0704	762	5
FilterDataTable	500000	10.0898	762	6
PowerShell	500000	27.0774	762	7
Rust	1000000	2.7926	1511	1
PythonPolars	1000000	2.90596	1511	2
DuckDB	1000000	4.03603	1511	3
PythonPandas	1000000	12.0829	1511	4
Linq	1000000	21.4446	1511	5
FilterDataTable	1000000	21.8539	1511	6
PowerShell	1000000	55.4838	1511	7
PythonPolars	2000000	4.68916	3126	1
Rust	2000000	5.62342	3126	2
DuckDB	2000000	7.57531	3126	3
PythonPandas	2000000	22.729	3126	4
FilterDataTable	2000000	51.2262	3126	5
Linq	2000000	54.8478	3126	6
PowerShell	2000000	106.559	3126	7

Run 3 - benchmark results table

Tool	RowCount	ExecutionTimeSeconds	RowsKept	Rank
DuckDB	100000	0.718989	184	1
Rust	100000	1.10113	184	2
PythonPolars	100000	1.37914	184	3
Linq	100000	1.99643	184	4
FilterDataTable	100000	2.00937	184	5
PythonPandas	100000	4.08225	184	6
PowerShell	100000	6.5047	184	7
Rust	250000	1.21478	413	1
DuckDB	250000	1.2367	413	2
PythonPolars	250000	1.74341	413	3
PythonPandas	250000	5.12249	413	4
FilterDataTable	250000	6.97901	413	5
Linq	250000	7.85241	413	6
PowerShell	250000	14.1751	413	7
Rust	500000	1.55787	762	1
PythonPolars	500000	1.88452	762	2
DuckDB	500000	1.88637	762	3
PythonPandas	500000	6.73607	762	4
Linq	500000	9.94437	762	5
FilterDataTable	500000	10.909	762	6
PowerShell	500000	27.1955	762	7
Rust	1000000	2.60658	1511	1
PythonPolars	1000000	3.02078	1511	2
DuckDB	1000000	3.59157	1511	3
PythonPandas	1000000	12.406	1511	4
Linq	1000000	20.9774	1511	5
FilterDataTable	1000000	21.4426	1511	6
PowerShell	1000000	53.9456	1511	7
PythonPolars	2000000	4.94405	3126	1
Rust	2000000	5.35049	3126	2
DuckDB	2000000	8.59179	3126	3
PythonPandas	2000000	26.4206	3126	4
FilterDataTable	2000000	50.3969	3126	5
Linq	2000000	77.3886	3126	6
PowerShell	2000000	107.122	3126	7

Potential gains

The results need to also show us the potential gains when comparing the best and the worst performing tool in the benchmark. To do this I use the Pandas groupby and pct_group methods.
Python Code - CalculatePotentialGains.py continued from BenchmarkPlotting.py

# Selecting only certain columns
output_table = df.select(["Tool","RowCount", "ExecutionTimeSeconds", "RowsKept", "Rank"]).to_pandas()

# Potential gains to be made when comparing to min and max values per group
pct_group = output_table.groupby(['RowCount'])['ExecutionTimeSeconds'].agg(['max','min'])
pct_group['PotentialGainsinPercent'] = ((pct_group['max']-pct_group['min'])/pct_group['max'])*100
pct_group = pct_group.reset_index(level=0)

# Saving as markdown files to attach with tutorial
output_table.to_markdown('BenchmarkFilterDatatable.md', index=False)
pct_group.to_markdown('PotentialGainsinPercent.md')

# Plotting the results in a altair bar chart
chart = alt.Chart(pct_group).mark_bar().encode(
    x=alt.X('PotentialGainsinPercent:Q', axis=alt.Axis(
        labelAngle=-45, title="Percentage")),
    y=alt.Y('RowCount:N'),
    color=alt.Color('PotentialGainsinPercent:Q',
                    scale=alt.Scale(scheme=color, reverse=False)),
    tooltip="PotentialGainsinPercent:Q",
)

text = chart.mark_text(
    color='black',
    dx=10
).encode(
    text=alt.Text(
        'PotentialGainsinPercent:Q',
        format=',.0f')
)


fig = chart + text
fig.configure_axis(
    grid=False
).configure_scale(
    bandPaddingInner=0.5
).configure_view(
    strokeWidth=0.4
).interactive()

fig.save('PotentialGainsChart.html')

Run1_PotentialGains

Run 1 - Potential Gains in Percent

	RowCount	max	min	PotentialGainsinPercent
0	100000	7.11996	0.863817	87.8677
1	250000	14.5997	1.37164	90.605
2	500000	27.767	1.99611	92.8112
3	1000000	55.9658	2.04737	96.3417
4	2000000	109.237	4.59196	95.7963

Run2_PotentialGains

Run 2 - Potential Gains in Percent

	RowCount	max	min	PotentialGainsinPercent
0	100000	6.53177	0.933689	85.7054
1	250000	13.9554	1.26209	90.9562
2	500000	27.0774	1.62974	93.9812
3	1000000	55.4838	2.7926	94.9668
4	2000000	106.559	4.68916	95.5995

Run3_PotentialGains

Run 3 - Potential Gains in Percent

	RowCount	max	min	PotentialGainsinPercent
0	100000	6.5047	0.718989	88.9466
1	250000	14.1751	1.21478	91.4302
2	500000	27.1955	1.55787	94.2716
3	1000000	53.9456	2.60658	95.1681
4	2000000	107.122	4.94405	95.3846

Observations

From the results, the three good performing tools (when considering execution time) are

Complied Rust executor
DuckDB
Python script using Polars

Rust executor was plain superior
I did notice that when integrated with UiPath the Rust executor was slower than when the same executor was run via PowerShell. For example, when I run manually from PowerShell the execution time to filter for 2000000 rows was 4.26, 4.51 and 3.80 seconds for 3 runs, which is lower than the results it achieved when run from Start-Process in UiPath. A possible reason for this could be because Start-Process activity first opens CMD to start PowerShell and then runs the rust executor. This is probably why Rust executor did not dwarf other approaches on all datasets.

Another challenge is that Rust as a language is not as easy to get started compared to say Python. Creating your own custom Rust programs with Polars crate will also add to development time. Imagine translating all the available Datatable activities in UiPath Studio into Rust programs. Performance will be great under runtime, but at the cost of additional development time. Although once created you can use them in all your projects.

On the contrary, DuckDB ran equally well both in stand-alone and when integrated with UiPath. To see Polars in Python performing so well was surprising to say the least. Linq query and FilterDatatable activity were much slower than most of the external tools.

Again, as found in the Myth busting post from @postwick, it is clear from this benchmark that using Linq queries over standard UiPath datatable activities does not improve performance drastically. In fact, on large datatables, standard Filter Datatable activity performed better than Linq query! For smaller datatables, they both are very close when it comes to execution performance.

In all three runs, PowerShell performed the worst out of all the approaches. Do not use PowerShell scripts for any datatable transformations, its slow, very slow!

For large datatables, the best performing approach improved performance by 96% when compared to the slowest approach. This is impactful in the RPA domain where there are millions of robot transactions and each datatable transformation could potentially save over 80% of execution time (given that these results will also replicate on other kinds of transformations).

Tasting my own medicine
If I had to convince my team to change our filtering logic in our automations, I would choose DuckDB for the following reasons

The easiest of the three to integrate with UiPath (no compiling or virtual environments to worry about)
We could use our SQL expert in the team to write optimized queries and drastically improve our transformations both documentation wise and performance wise
The DuckDB executor is tiny, just 24 megabytes and does not need to be packed in the process. It can be downloaded during robot runtime.
It can be easily updated (new version) by the robot without the need to update the published process
It provides an option to save database in different formats (parquet as well) on a shared drive if we needed to

Your feedback

Thank for reading this post.

I would like to call on CoE leads, Solution Architects and Developers to analyze their robot utilization. If a large chuck of your robot utilization is due to processing of datatables, then you should definitely evaluate all three approaches on your use cases; Rust executor, DuckDB and Python script using Polars module.

Feel free to provide feedback to improve this benchmark. If there are things I overlooked, I can fix them before I publish the xaml files either here or in GitHub. My wish is to run this benchmark on a pc with better specifications and update the results in this post.

jeevith · May 14, 2023, 1:24pm

Hello all,

After some discussions with the community members, I want to re-emphasize that this benchmark is just a benchmark. Do not use the results from this benchmark to push down on native activities and/or queries in UiPath Studio.

Use the approach that you and your team deem fit-for-use and fit-for-purpose. There are always trade-offs when it comes to optimizations.

The aim of this post as I said before is to perform an objective analysis of different ETL approaches within UiPath Studio and look for alternative performant/newer tools and not to pick favourites.

I repeated the above experiment on one of my other PC’s (a relatively newer PC) to retest the performance and wanted to share the results here.

PC Specifications

Specification	Value
Name	AMD Ryzen 5 5600G with Radeon Graphics
NumberOfCores	6
NumberOfLogicalProcessors	12
Architecture	x64
CurrentClockSpeed	3901
Availability	RunningOrFullPower
OsName	Microsoft Windows 11 Pro
CsPhyicallyInstalledMemory	32 GB

Results for a single filter operation

Run 1 - benchmark results table

Tool	RowCount	ExecutionTimeSeconds	RowsKept	Rank
DuckDB	100000	0.718989	184	1
Rust	100000	0.7309	184	2
PythonPolars	100000	0.836004	184	3
Linq	100000	1.05674	184	4
FilterDataTable	100000	1.24326	184	5
PythonPandas	100000	1.88014	184	6
PowerShell	100000	3.0373	184	7
Rust	250000	0.851104	413	1
DuckDB	250000	0.953046	413	2
PythonPolars	250000	0.996677	413	3
PythonPandas	250000	2.54604	413	4
FilterDataTable	250000	2.64703	413	5
Linq	250000	3.32452	413	6
PowerShell	250000	6.44875	413	7
Rust	500000	0.868724	762	1
PythonPolars	500000	0.986469	762	2
DuckDB	500000	1.32038	762	3
PythonPandas	500000	4.07691	762	4
Linq	500000	5.21964	762	5
FilterDataTable	500000	5.58746	762	6
PowerShell	500000	12.2488	762	7
Rust	1000000	1.08966	1511	1
PythonPolars	1000000	1.18384	1511	2
DuckDB	1000000	2.18962	1511	3
PythonPandas	1000000	7.30841	1511	4
FilterDataTable	1000000	11.1194	1511	5
Linq	1000000	11.1598	1511	6
PowerShell	1000000	24.2141	1511	7
Rust	2000000	1.59035	3126	1
PythonPolars	2000000	1.63263	3126	2
DuckDB	2000000	3.72812	3126	3
PythonPandas	2000000	12.7466	3126	4
Linq	2000000	23.022	3126	5
FilterDataTable	2000000	23.8702	3126	6
PowerShell	2000000	46.3586	3126	7

Run 2 - benchmark results table

Tool	RowCount	ExecutionTimeSeconds	RowsKept	Rank
DuckDB	100000	0.71398	184	1
Rust	100000	0.718518	184	2
PythonPolars	100000	0.81498	184	3
FilterDataTable	100000	1.05043	184	4
Linq	100000	1.26911	184	5
PythonPandas	100000	1.82941	184	6
PowerShell	100000	3.0858	184	7
Rust	250000	0.8698	413	1
DuckDB	250000	0.929645	413	2
PythonPolars	250000	0.938026	413	3
Linq	250000	2.5261	413	4
PythonPandas	250000	2.67638	413	5
FilterDataTable	250000	2.81213	413	6
PowerShell	250000	6.45576	413	7
Rust	500000	0.945229	762	1
PythonPolars	500000	1.02525	762	2
DuckDB	500000	1.27723	762	3
PythonPandas	500000	4.01737	762	4
Linq	500000	5.88147	762	5
FilterDataTable	500000	5.90982	762	6
PowerShell	500000	12.6026	762	7
Rust	1000000	0.995003	1511	1
PythonPolars	1000000	1.20503	1511	2
DuckDB	1000000	2.3153	1511	3
PythonPandas	1000000	7.08859	1511	4
Linq	1000000	11.0784	1511	5
FilterDataTable	1000000	11.8625	1511	6
PowerShell	1000000	24.1542	1511	7
Rust	2000000	1.30956	3126	1
PythonPolars	2000000	1.57842	3126	2
DuckDB	2000000	3.88727	3126	3
PythonPandas	2000000	13.2473	3126	4
FilterDataTable	2000000	22.4328	3126	5
Linq	2000000	22.791	3126	6
PowerShell	2000000	46.2426	3126	7

Run 3 - benchmark results table

Tool	RowCount	ExecutionTimeSeconds	RowsKept	Rank
DuckDB	100000	0.696087	184	1
Rust	100000	0.730319	184	2
PythonPolars	100000	0.824356	184	3
Linq	100000	1.09894	184	4
FilterDataTable	100000	1.30967	184	5
PythonPandas	100000	1.68164	184	6
PowerShell	100000	3.1136	184	7
PythonPolars	250000	0.931645	413	1
DuckDB	250000	0.943917	413	2
Rust	250000	0.986826	413	3
Linq	250000	2.57431	413	4
FilterDataTable	250000	2.59124	413	5
PythonPandas	250000	2.60687	413	6
PowerShell	250000	6.55377	413	7
Rust	500000	0.844374	762	1
PythonPolars	500000	1.0048	762	2
DuckDB	500000	1.36568	762	3
PythonPandas	500000	3.89163	762	4
Linq	500000	6.05201	762	5
FilterDataTable	500000	6.18218	762	6
PowerShell	500000	12.1815	762	7
Rust	1000000	1.03838	1511	1
PythonPolars	1000000	1.22731	1511	2
DuckDB	1000000	2.19648	1511	3
PythonPandas	1000000	6.6133	1511	4
Linq	1000000	11.4868	1511	5
FilterDataTable	1000000	11.5377	1511	6
PowerShell	1000000	23.6053	1511	7
Rust	2000000	1.3449	3126	1
PythonPolars	2000000	1.53425	3126	2
DuckDB	2000000	3.71976	3126	3
PythonPandas	2000000	12.2057	3126	4
Linq	2000000	22.2361	3126	5
FilterDataTable	2000000	24.9076	3126	6
PowerShell	2000000	47.0444	3126	7

Potential gains

Run1_PotentialGains

Run 1 - Potential Gains in Percent

	RowCount	max	min	PotentialGainsinPercent
0	100000	3.0373	0.718989	76.328
1	250000	6.44875	0.851104	86.802
2	500000	12.2488	0.868724	92.9077
3	1e+06	24.2141	1.08966	95.4999
4	2e+06	46.3586	1.59035	96.5695

Run2_PotentialGains

Run 2 - Potential Gains in Percent

	RowCount	max	min	PotentialGainsinPercent
0	100000	3.0858	0.71398	76.8624
1	250000	6.45576	0.8698	86.5268
2	500000	12.6026	0.945229	92.4997
3	1e+06	24.1542	0.995003	95.8806
4	2e+06	46.2426	1.30956	97.1681

Run3_PotentialGains

Run 3 - Potential Gains in Percent

	RowCount	max	min	PotentialGainsinPercent
0	100000	3.1136	0.696087	77.6436
1	250000	6.55377	0.931645	85.7846
2	500000	12.1815	0.844374	93.0684
3	1e+06	23.6053	1.03838	95.6011
4	2e+06	47.0444	1.3449	97.1412

Are you interested in contributing to the benchmark?

To be transparent with the implementation, I am sharing the workflow files of this benchmark on GitHub.

You will have to recreate dummy input data for different sizes using the Faker module as shown in the original post.

If anyone wants to extend this benchmark with other transformations like groupby, joins etc., feel free to send me a pull request in GitHub. Thank you for your contributions.

Topic		Replies	Views
UiPath Datatable Loop Performance Issue Help datatable , activities , bestpractices , question	7	4051	January 28, 2020
UIPath capabilities - mainframe migration? Studio	1	734	September 29, 2020
UiPath 2021.2 Monthly Update Blogs	0	818	July 14, 2021
Looking for pro's and con's between using excel vs MariaDB Help	0	722	July 23, 2020
February 2021 Monthly UiPath Product Update \| UiPath Blogs	1	979	February 23, 2021

Most Active Users - Yesterday
Anil_G
ashokkarale
sudster
Matias_Clemente.Arg
Ayatulla_Middya
Yoichi
Phenyo
A_Learner
Kumar_Pawan_2_Consultant
More details...

Benchmarking Extract Transform Load (ETL) pipelines in UiPath

Reasons to perform such benchmarks in UiPath

Disclaimers

Creating the input data

The filter benchmark

PC Specifications

Results for a single filter operation

Potential gains

Observations

Your feedback

PC Specifications

Results for a single filter operation

Potential gains

Are you interested in contributing to the benchmark?

Related topics