Benchmarking Extract Transform Load (ETL) pipelines in UiPath


Target audience: CoE Leads, RPA Solution Architects and RPA Developers


Reasons to perform such benchmarks in UiPath

  1. The UiPath forum gets a lot of queries regarding the use of Linq queries to mainpulate datatables. I see many developers have fallen for the myth that Linq queries are more performant than the standard Filter Datatable activity. I started of investigating the two aforementioned techniques, but I did want to share my experience with other tools which can outperform any standard UiPath datatable activity or Linq query. I thank @postwick for taking the lead and starting this discussion way back in 2021, and @ppr for helping us bust the myth.
  2. We are developers. If we do not adapt and learn new technologies we are redundant. If you or someone you know is looking for the best performing ETL (Extract, Transform, Load) flow, then they have to evaluate new tools to improve their efficiency.
  3. I started learning Rust and came across Polars crate and how good of a tool it can be for datatable manipulations. The developer of Polars ( Ritchie Vink) wrote about his benchmarking results in this blog and I was sold. I had to try it out.
  4. @StefanSchnell 's integration tutorial also motivated me to show how new tools can be integrated with UiPath without any prebuild libraries or official integrations. Through this post, I can contribute two more integrations (DuckDB and Rust) in addition to the ones mentioned in his tutorial
  5. Even if one of you readers gets motivated and uses some of these tools mentioned here, someday it might help reduce a large amount of robot processing time. There is a lot of potential time savings (without compromising maintainability and robustness) by using some of the newer tools in your automations.

Disclaimers

Disclaimers
  1. I am starting small. The first version of this benchmark was created during my vacation and on an old laptop on an Intel(R) Coreβ„’ i5-4210U CPU @ 1.70GHz with a mere 4GB RAM. Therefore, I have chosen the first benchmark to only test a filter transformation. There is no doubt, this test results would be blazing fast on current multi-core processors, specially the tools which take advantage of new high core count CPU architectures.
  2. I am not advocating against using Linq queries or Filter Datatable activity in UiPath Studio, but I do what the reader to know about newer techniques and allow them to make an informed decision. They (Filter Datatable and Linq queries) are still the easiest to start with as they are fully supported by UiPath, but if you are clearly focused only on performance, then they are not good enough in my view.
  3. In hindsight, I could have used the Testing Project template, but I started with a Windows Standard project in UiPath and did not wish to change it. However, this has not affected the time measurements made in this benchmark.
  4. Although the benchmark files to be included (after some weeks of publishing this tutorial) are well thought through, do not use them in any serious automation. They do not contain any error handling capabilities. I will wait for the community feedback before publishing the benchmark xaml files. May be I can even publish them on UiPath Marketplace and make additional tutorials posts on how to integrate each of them in UiPath.
  5. Csv file as input file is not the newest and best format to test such benchmarks. But since UiPath has out of the box support to read and write csv file, I chose to test on Csv files. Parquet file format is the current go to format in the data science community.
  6. No PC performance monitoring was carried out when the test ran. So I do not have data on the processor usage and ram utilization. May be a grafana server polling the PC performance may have helped in analyzing this.

Creating the input data

To imitate realistic datatables, I used the Faker library in Python. Faker is a handy module which can create fictive data in a fast and easy way. In this case, I iterate over a list of required rows in a datatable and create them in a for loop. To ensure easy integration with UiPath I chose the csv file format.

Python Code - FakeDataGenerator.py

from faker import Faker
import pandas as pd

fake = Faker()
NumRows = [10000, 100000, 250000, 500000, 1000000, 2000000] 
for num in NumRows:
    profileData = [fake.profile() for i in range(num)]
    df = pd.DataFrame(profileData)
    df.to_csv("{0}_input.csv".format(num), index=False)
    print("Saved the file {0}_input.csv".format(num))

print("All dummy data created")

The above code generates the input datatables with row counts of 10000, 100000, 250000, 500000, 1000000, and 2000000. Below is the sample input data used in this benchmark.
Input Data Shape (RowCount * 13 columns)

job company ssn residence current_location blood_group website username name sex address mail birthdate
Print production planner Klein-Caldwell 424-79-6551 252 Davis Forks Apt. 948
Kelleyhaven, NC 18661 (Decimal(β€˜-86.2965425’), Decimal(β€˜-41.336330’)) B+ [β€˜http://www.bird.com/’, β€˜https://www.miller.net/’] watsonsteven Kevin Ball M 34641 Glenn Oval Suite 696
Lake Shellyhaven, ID 68870 erin57@hotmail.com 2011-08-28
Research officer, political party Benton, Thompson and Powell 172-32-9022 5401 Lewis Ville Apt. 542
South Brett, MD 72875 (Decimal(β€˜9.974042’), Decimal(β€˜-179.607076’)) AB- [β€˜https://fuller.com/’, β€˜http://kaiser-parker.com/’, β€˜https://www.garcia.info/’, β€˜http://hill-vasquez.net/’] tiffanyleonard Linda Jones F 83606 Richardson Spring
Scottborough, FM 89592 tammywashington@gmail.com 1973-07-06
Statistician Thomas, Carr and Medina 384-30-9782 4213 Connor Street
Petersport, DC 54921 (Decimal(β€˜-60.1181895’), Decimal(β€˜32.682063’)) O- [β€˜https://phillips.net/’] jimmy18 Manuel Boone M 14996 Brad Village
South Jessicaport, MI 91834 shellyfrancis@yahoo.com 1981-07-14

The filter benchmark

The filter condition used in all approaches translates to the same SQL logic. Return rows where the input table column job has the value 'Social worker'.

SELECT * FROM  TESTDATATABLE
WHERE job='Social worker';
  • The Start and Stop timer activities were used to time only the unit under test (ingesting Csv data, transforming it and saving the result to a new file). This way we remain unbiased and objective while testing these approaches. In short, apples to apples comparison.

  • The execution was performed via the UiPath UiRobot.exe and not run from Studio, this was done to avoid any debugging lags.

    C:\Users\%USER%\AppData\Local\Programs\UiPath\Studio\UiRobot.exe  execute --file "C:\Users\%USER%\Documents\UiPath\_Nugets\BenchmarkingETLApproachesinUiPath.1.0.3.nupkg"  --input "{'in_InputDataFolder':'C:\\Users\\%USER%\\Documents\\UiPath\\InputData'}" --entry "FilterBenchmarker.xaml"
    
  • To test repeatability, 3 runs were executed. Lower the bar (greener the bar), the better performant is the approach to filter a datatable.

In total the filter benchmark tests 7 different approaches / tools.

  1. UiPath Filter Datatable activity
    image

  2. Linq Query in an Assign activity
    InputData.AsEnumerable.Where(Function(r) r.Item("job").ToString.Equals("Social worker")).CopyToDataTable

  3. Running Python script using Pandas library

    import pandas as pd
    import argparse
    
    parser = parser = argparse.ArgumentParser()
    parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
    parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
    args = parser.parse_args()
    
    # Read csv
    df = pd.read_csv(args.InputCSVPath)
    # Apply filter
    filtered_df = df[df.job == "Social worker"]
    # save output csv
    filtered_df.to_csv(args.OutputCSVPath)
    
  4. Running Python script using Polars library

    import polars as pl
    import argparse
    
    parser = parser = argparse.ArgumentParser()
    parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
    parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
    args = parser.parse_args()
    
    # Read csv
    df = pl.read_csv(file=args.InputCSVPath) 
    # Apply filter
    filtered_df = df.filter(pl.col("job").str.contains("Social worker"))
    # save output csv
    filtered_df.write_csv(file=args.OutputCSVPath)
    
    
  5. Running a PowerShell script

    Import-Csv -Path 'INPUTFILE' | Where { $_.job -eq 'Social worker'} | Export-Csv -Path 'OUTPUTFILE' -NoTypeInformation
    
  6. Running the DuckDB executor with using SQL command
    Read more about why DuckDB is great for analytic transformations

    duckdb.exe -c "CREATE OR REPLACE TABLE TABLENAME AS SELECT * FROM read_csv_auto('FULLFILEPATH'); COPY (SELECT * FROM TABLENAME WHERE job='Social worker') TO 'OUTPUTPATH' (HEADER, DELIMITER ',');"
    

    -c tag tells duckdb.exe to run command directly in CLI without opening a new in-memory database

  7. Running a Rust program using Polars crate

    use polars::{prelude::*, lazy::dsl::col};
    use std::env;
    
    fn main() {
        // Setting arguments for input and output csv
        let args: Vec<String> = env::args().collect();
        let csv_path = &args[1];
        let result_path = &args[2];
    
        // Reading csv files into a polars dataframe
        let df = CsvReader::from_path(csv_path).unwrap().finish().unwrap();
    
        // use the predicate to filter
        let predicate = col("job").eq(lit("Social worker"));
       
        let filtered_df = df
                        .clone()
                        .lazy()
                        .filter(predicate )
                        .collect()
                        .unwrap();
    
        let mut output = filtered_df;
        
        // Writing to a csv file
        let mut file = std::fs::File::create(result_path).unwrap();
        CsvWriter::new(&mut file).finish(&mut output).unwrap();
    }
    

    Example command to be used in Start-Process. Remember to use cargo build --release or cargo run --release arguments this will compile the project and then use the release folder to locate the executor (.exe) file see documentation here.

    C:\Users\%USER%\Documents\RustProjects\data_processer\target\release\data_processer.exe  'C:\Users\%USER%\Documents\UiPath\InputData\2000000_input.csv'  'OUTPUT.csv'
    

PC Specifications

Specification Value
Name Intel(R) Coreβ„’ i5-4210U CPU @ 1.70GHz
NumberOfCores 2
NumberOfLogicalProcessors 4
Architecture x64
CurrentClockSpeed 1700
Availability RunningOrFullPower
OsName Microsoft Windows 10 Home Single Language
CsPhyicallyInstalledMemory 4 GB

Results for a single filter operation

This benchmarking project results in an excel file contain all the runs with different tools. To analyze and plot the results I used Python.

Python Code - BenchmarkPlotting.py

import polars as pl
import pandas as pd
import altair as alt

df = pl.read_excel("FilterBenchmarkResults.xlsx", sheet_name="Results")
# Transforming data and creating new columns
df = df.with_columns([
    (pl.col("FileName").str.split(by="_").arr.get(0).cast(pl.Int64)).alias("RowCount"),
    (pl.col("Approach").str.split(by="FilterUsing").arr.get(1)).alias("Tool"),    
])
df = df.with_columns([
  (pl.col("ExecutionTimeSeconds").rank(method="max").over(["RowCount"]).alias("Rank")),  
])
# Sorting the data for better readability
df = df.sort(["RowCount","ExecutionTimeSeconds"])

# as altair does not support Polars dataframes converting to pandas dataframe
final_df= df.to_pandas()

# Plotting the results in a altair bar chart
chart = alt.Chart( ).transform_calculate(
        ToolTip="datum.Tool+'='+datum.ExecutionTimeSeconds").mark_bar().encode( 
        x=alt.X('Tool:N', axis=alt.Axis(labelAngle=-45, title=None)),
        y=alt.Y('ExecutionTimeSeconds:Q'),
        color=alt.Color('Rank:Q',scale=alt.Scale(scheme="blues", reverse=True)),
        tooltip="ToolTip:N",
        order='Rank:Q',
    )
    
text = chart.mark_text(
        color = 'black',
        dy= -5
    ).encode(
        text = alt.Text(
            'ExecutionTimeSeconds:Q',
            format = ',.0f')
    )
    
fig = alt.layer(chart, text, data=final_df).facet(
    column=alt.Column(
        'RowCount:Q', 
        )
    ).configure_facet(
    spacing=20
).configure_axis(
        grid=False
    ).configure_view(
        strokeWidth=0.4
    ).interactive()

fig.save('BenchmarkFilterDatatable.html')

Run 1 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.863817 184 1
Rust 100000 0.992763 184 2
PythonPolars 100000 1.62365 184 3
FilterDataTable 100000 2.28531 184 4
Linq 100000 2.60637 184 5
PowerShell 100000 6.74261 184 6
PythonPandas 100000 7.11996 184 7
DuckDB 250000 1.37164 413 1
Rust 250000 1.64754 413 2
PythonPolars 250000 2.01796 413 3
PythonPandas 250000 4.97285 413 4
FilterDataTable 250000 7.63237 413 5
Linq 250000 8.04975 413 6
PowerShell 250000 14.5997 413 7
Rust 500000 1.99611 762 1
PythonPolars 500000 2.02914 762 2
DuckDB 500000 2.10029 762 3
PythonPandas 500000 7.38287 762 4
Linq 500000 11.5408 762 5
FilterDataTable 500000 11.6786 762 6
PowerShell 500000 27.767 762 7
Rust 1000000 2.04737 1511 1
PythonPolars 1000000 3.09737 1511 2
DuckDB 1000000 3.74764 1511 3
PythonPandas 1000000 13.0274 1511 4
FilterDataTable 1000000 23.8323 1511 5
Linq 1000000 23.9049 1511 6
PowerShell 1000000 55.9658 1511 7
Rust 2000000 4.59196 3126 1
PythonPolars 2000000 5.40843 3126 2
DuckDB 2000000 7.18215 3126 3
PythonPandas 2000000 25.0529 3126 4
FilterDataTable 2000000 49.568 3126 5
Linq 2000000 57.2253 3126 6
PowerShell 2000000 109.237 3126 7

Run 2 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.933689 184 1
Rust 100000 1.17638 184 2
PythonPolars 100000 1.32587 184 3
FilterDataTable 100000 2.0938 184 4
Linq 100000 2.871 184 5
PythonPandas 100000 4.74992 184 6
PowerShell 100000 6.53177 184 7
Rust 250000 1.26209 413 1
PythonPolars 250000 1.70902 413 2
DuckDB 250000 1.73422 413 3
PythonPandas 250000 4.72671 413 4
Linq 250000 5.84544 413 5
FilterDataTable 250000 6.75782 413 6
PowerShell 250000 13.9554 413 7
Rust 500000 1.62974 762 1
PythonPolars 500000 1.92046 762 2
DuckDB 500000 2.87362 762 3
PythonPandas 500000 6.79275 762 4
Linq 500000 10.0704 762 5
FilterDataTable 500000 10.0898 762 6
PowerShell 500000 27.0774 762 7
Rust 1000000 2.7926 1511 1
PythonPolars 1000000 2.90596 1511 2
DuckDB 1000000 4.03603 1511 3
PythonPandas 1000000 12.0829 1511 4
Linq 1000000 21.4446 1511 5
FilterDataTable 1000000 21.8539 1511 6
PowerShell 1000000 55.4838 1511 7
PythonPolars 2000000 4.68916 3126 1
Rust 2000000 5.62342 3126 2
DuckDB 2000000 7.57531 3126 3
PythonPandas 2000000 22.729 3126 4
FilterDataTable 2000000 51.2262 3126 5
Linq 2000000 54.8478 3126 6
PowerShell 2000000 106.559 3126 7

Run 3 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.718989 184 1
Rust 100000 1.10113 184 2
PythonPolars 100000 1.37914 184 3
Linq 100000 1.99643 184 4
FilterDataTable 100000 2.00937 184 5
PythonPandas 100000 4.08225 184 6
PowerShell 100000 6.5047 184 7
Rust 250000 1.21478 413 1
DuckDB 250000 1.2367 413 2
PythonPolars 250000 1.74341 413 3
PythonPandas 250000 5.12249 413 4
FilterDataTable 250000 6.97901 413 5
Linq 250000 7.85241 413 6
PowerShell 250000 14.1751 413 7
Rust 500000 1.55787 762 1
PythonPolars 500000 1.88452 762 2
DuckDB 500000 1.88637 762 3
PythonPandas 500000 6.73607 762 4
Linq 500000 9.94437 762 5
FilterDataTable 500000 10.909 762 6
PowerShell 500000 27.1955 762 7
Rust 1000000 2.60658 1511 1
PythonPolars 1000000 3.02078 1511 2
DuckDB 1000000 3.59157 1511 3
PythonPandas 1000000 12.406 1511 4
Linq 1000000 20.9774 1511 5
FilterDataTable 1000000 21.4426 1511 6
PowerShell 1000000 53.9456 1511 7
PythonPolars 2000000 4.94405 3126 1
Rust 2000000 5.35049 3126 2
DuckDB 2000000 8.59179 3126 3
PythonPandas 2000000 26.4206 3126 4
FilterDataTable 2000000 50.3969 3126 5
Linq 2000000 77.3886 3126 6
PowerShell 2000000 107.122 3126 7

Potential gains

The results need to also show us the potential gains when comparing the best and the worst performing tool in the benchmark. To do this I use the Pandas groupby and pct_group methods.
Python Code - CalculatePotentialGains.py continued from BenchmarkPlotting.py

# Selecting only certain columns
output_table = df.select(["Tool","RowCount", "ExecutionTimeSeconds", "RowsKept", "Rank"]).to_pandas()

# Potential gains to be made when comparing to min and max values per group
pct_group = output_table.groupby(['RowCount'])['ExecutionTimeSeconds'].agg(['max','min'])
pct_group['PotentialGainsinPercent'] = ((pct_group['max']-pct_group['min'])/pct_group['max'])*100
pct_group = pct_group.reset_index(level=0)

# Saving as markdown files to attach with tutorial
output_table.to_markdown('BenchmarkFilterDatatable.md', index=False)
pct_group.to_markdown('PotentialGainsinPercent.md')

# Plotting the results in a altair bar chart
chart = alt.Chart(pct_group).mark_bar().encode(
    x=alt.X('PotentialGainsinPercent:Q', axis=alt.Axis(
        labelAngle=-45, title="Percentage")),
    y=alt.Y('RowCount:N'),
    color=alt.Color('PotentialGainsinPercent:Q',
                    scale=alt.Scale(scheme=color, reverse=False)),
    tooltip="PotentialGainsinPercent:Q",
)

text = chart.mark_text(
    color='black',
    dx=10
).encode(
    text=alt.Text(
        'PotentialGainsinPercent:Q',
        format=',.0f')
)


fig = chart + text
fig.configure_axis(
    grid=False
).configure_scale(
    bandPaddingInner=0.5
).configure_view(
    strokeWidth=0.4
).interactive()

fig.save('PotentialGainsChart.html')

Run1_PotentialGains

Run 1 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 7.11996 0.863817 87.8677
1 250000 14.5997 1.37164 90.605
2 500000 27.767 1.99611 92.8112
3 1000000 55.9658 2.04737 96.3417
4 2000000 109.237 4.59196 95.7963

Run2_PotentialGains

Run 2 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 6.53177 0.933689 85.7054
1 250000 13.9554 1.26209 90.9562
2 500000 27.0774 1.62974 93.9812
3 1000000 55.4838 2.7926 94.9668
4 2000000 106.559 4.68916 95.5995

Run3_PotentialGains

Run 3 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 6.5047 0.718989 88.9466
1 250000 14.1751 1.21478 91.4302
2 500000 27.1955 1.55787 94.2716
3 1000000 53.9456 2.60658 95.1681
4 2000000 107.122 4.94405 95.3846

Observations

From the results, the three good performing tools (when considering execution time) are

  1. Complied Rust executor
  2. DuckDB
  3. Python script using Polars

Rust executor was plain superior
I did notice that when integrated with UiPath the Rust executor was slower than when the same executor was run via PowerShell. For example, when I run manually from PowerShell the execution time to filter for 2000000 rows was 4.26, 4.51 and 3.80 seconds for 3 runs, which is lower than the results it achieved when run from Start-Process in UiPath. A possible reason for this could be because Start-Process activity first opens CMD to start PowerShell and then runs the rust executor. This is probably why Rust executor did not dwarf other approaches on all datasets.

Another challenge is that Rust as a language is not as easy to get started compared to say Python. Creating your own custom Rust programs with Polars crate will also add to development time. Imagine translating all the available Datatable activities in UiPath Studio into Rust programs. Performance will be great under runtime, but at the cost of additional development time. Although once created you can use them in all your projects.

On the contrary, DuckDB ran equally well both in stand-alone and when integrated with UiPath. To see Polars in Python performing so well was surprising to say the least. Linq query and FilterDatatable activity were much slower than most of the external tools.

Again, as found in the Myth busting post from @postwick, it is clear from this benchmark that using Linq queries over standard UiPath datatable activities does not improve performance drastically. In fact, on large datatables, standard Filter Datatable activity performed better than Linq query! For smaller datatables, they both are very close when it comes to execution performance.

In all three runs, PowerShell performed the worst out of all the approaches. Do not use PowerShell scripts for any datatable transformations, its slow, very slow!

For large datatables, the best performing approach improved performance by 96% when compared to the slowest approach. This is impactful in the RPA domain where there are millions of robot transactions and each datatable transformation could potentially save over 80% of execution time (given that these results will also replicate on other kinds of transformations).

Tasting my own medicine
If I had to convince my team to change our filtering logic in our automations, I would choose DuckDB for the following reasons

  1. The easiest of the three to integrate with UiPath (no compiling or virtual environments to worry about)
  2. We could use our SQL expert in the team to write optimized queries and drastically improve our transformations both documentation wise and performance wise
  3. The DuckDB executor is tiny, just 24 megabytes and does not need to be packed in the process. It can be downloaded during robot runtime.
  4. It can be easily updated (new version) by the robot without the need to update the published process
  5. It provides an option to save database in different formats (parquet as well) on a shared drive if we needed to

Your feedback

Thank for reading this post.

I would like to call on CoE leads, Solution Architects and Developers to analyze their robot utilization. If a large chuck of your robot utilization is due to processing of datatables, then you should definitely evaluate all three approaches on your use cases; Rust executor, DuckDB and Python script using Polars module.

Feel free to provide feedback to improve this benchmark. If there are things I overlooked, I can fix them before I publish the xaml files either here or in GitHub. My wish is to run this benchmark on a pc with better specifications and update the results in this post.

3 Likes

Hello all,

After some discussions with the community members, I want to re-emphasize that this benchmark is just a benchmark. Do not use the results from this benchmark to push down on native activities and/or queries in UiPath Studio.

Use the approach that you and your team deem fit-for-use and fit-for-purpose. There are always trade-offs when it comes to optimizations.

The aim of this post as I said before is to perform an objective analysis of different ETL approaches within UiPath Studio and look for alternative performant/newer tools and not to pick favourites.

I repeated the above experiment on one of my other PC’s (a relatively newer PC) to retest the performance and wanted to share the results here.


PC Specifications

Specification Value
Name AMD Ryzen 5 5600G with Radeon Graphics
NumberOfCores 6
NumberOfLogicalProcessors 12
Architecture x64
CurrentClockSpeed 3901
Availability RunningOrFullPower
OsName Microsoft Windows 11 Pro
CsPhyicallyInstalledMemory 32 GB

Results for a single filter operation

Run 1 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.718989 184 1
Rust 100000 0.7309 184 2
PythonPolars 100000 0.836004 184 3
Linq 100000 1.05674 184 4
FilterDataTable 100000 1.24326 184 5
PythonPandas 100000 1.88014 184 6
PowerShell 100000 3.0373 184 7
Rust 250000 0.851104 413 1
DuckDB 250000 0.953046 413 2
PythonPolars 250000 0.996677 413 3
PythonPandas 250000 2.54604 413 4
FilterDataTable 250000 2.64703 413 5
Linq 250000 3.32452 413 6
PowerShell 250000 6.44875 413 7
Rust 500000 0.868724 762 1
PythonPolars 500000 0.986469 762 2
DuckDB 500000 1.32038 762 3
PythonPandas 500000 4.07691 762 4
Linq 500000 5.21964 762 5
FilterDataTable 500000 5.58746 762 6
PowerShell 500000 12.2488 762 7
Rust 1000000 1.08966 1511 1
PythonPolars 1000000 1.18384 1511 2
DuckDB 1000000 2.18962 1511 3
PythonPandas 1000000 7.30841 1511 4
FilterDataTable 1000000 11.1194 1511 5
Linq 1000000 11.1598 1511 6
PowerShell 1000000 24.2141 1511 7
Rust 2000000 1.59035 3126 1
PythonPolars 2000000 1.63263 3126 2
DuckDB 2000000 3.72812 3126 3
PythonPandas 2000000 12.7466 3126 4
Linq 2000000 23.022 3126 5
FilterDataTable 2000000 23.8702 3126 6
PowerShell 2000000 46.3586 3126 7

Run 2 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.71398 184 1
Rust 100000 0.718518 184 2
PythonPolars 100000 0.81498 184 3
FilterDataTable 100000 1.05043 184 4
Linq 100000 1.26911 184 5
PythonPandas 100000 1.82941 184 6
PowerShell 100000 3.0858 184 7
Rust 250000 0.8698 413 1
DuckDB 250000 0.929645 413 2
PythonPolars 250000 0.938026 413 3
Linq 250000 2.5261 413 4
PythonPandas 250000 2.67638 413 5
FilterDataTable 250000 2.81213 413 6
PowerShell 250000 6.45576 413 7
Rust 500000 0.945229 762 1
PythonPolars 500000 1.02525 762 2
DuckDB 500000 1.27723 762 3
PythonPandas 500000 4.01737 762 4
Linq 500000 5.88147 762 5
FilterDataTable 500000 5.90982 762 6
PowerShell 500000 12.6026 762 7
Rust 1000000 0.995003 1511 1
PythonPolars 1000000 1.20503 1511 2
DuckDB 1000000 2.3153 1511 3
PythonPandas 1000000 7.08859 1511 4
Linq 1000000 11.0784 1511 5
FilterDataTable 1000000 11.8625 1511 6
PowerShell 1000000 24.1542 1511 7
Rust 2000000 1.30956 3126 1
PythonPolars 2000000 1.57842 3126 2
DuckDB 2000000 3.88727 3126 3
PythonPandas 2000000 13.2473 3126 4
FilterDataTable 2000000 22.4328 3126 5
Linq 2000000 22.791 3126 6
PowerShell 2000000 46.2426 3126 7

Run 3 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.696087 184 1
Rust 100000 0.730319 184 2
PythonPolars 100000 0.824356 184 3
Linq 100000 1.09894 184 4
FilterDataTable 100000 1.30967 184 5
PythonPandas 100000 1.68164 184 6
PowerShell 100000 3.1136 184 7
PythonPolars 250000 0.931645 413 1
DuckDB 250000 0.943917 413 2
Rust 250000 0.986826 413 3
Linq 250000 2.57431 413 4
FilterDataTable 250000 2.59124 413 5
PythonPandas 250000 2.60687 413 6
PowerShell 250000 6.55377 413 7
Rust 500000 0.844374 762 1
PythonPolars 500000 1.0048 762 2
DuckDB 500000 1.36568 762 3
PythonPandas 500000 3.89163 762 4
Linq 500000 6.05201 762 5
FilterDataTable 500000 6.18218 762 6
PowerShell 500000 12.1815 762 7
Rust 1000000 1.03838 1511 1
PythonPolars 1000000 1.22731 1511 2
DuckDB 1000000 2.19648 1511 3
PythonPandas 1000000 6.6133 1511 4
Linq 1000000 11.4868 1511 5
FilterDataTable 1000000 11.5377 1511 6
PowerShell 1000000 23.6053 1511 7
Rust 2000000 1.3449 3126 1
PythonPolars 2000000 1.53425 3126 2
DuckDB 2000000 3.71976 3126 3
PythonPandas 2000000 12.2057 3126 4
Linq 2000000 22.2361 3126 5
FilterDataTable 2000000 24.9076 3126 6
PowerShell 2000000 47.0444 3126 7

Potential gains

Run1_PotentialGains

Run 1 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 3.0373 0.718989 76.328
1 250000 6.44875 0.851104 86.802
2 500000 12.2488 0.868724 92.9077
3 1e+06 24.2141 1.08966 95.4999
4 2e+06 46.3586 1.59035 96.5695

Run2_PotentialGains

Run 2 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 3.0858 0.71398 76.8624
1 250000 6.45576 0.8698 86.5268
2 500000 12.6026 0.945229 92.4997
3 1e+06 24.1542 0.995003 95.8806
4 2e+06 46.2426 1.30956 97.1681

Run3_PotentialGains

Run 3 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 3.1136 0.696087 77.6436
1 250000 6.55377 0.931645 85.7846
2 500000 12.1815 0.844374 93.0684
3 1e+06 23.6053 1.03838 95.6011
4 2e+06 47.0444 1.3449 97.1412

Are you interested in contributing to the benchmark?

To be transparent with the implementation, I am sharing the workflow files of this benchmark on GitHub.

You will have to recreate dummy input data for different sizes using the Faker module as shown in the original post.

If anyone wants to extend this benchmark with other transformations like groupby, joins etc., feel free to send me a pull request in GitHub. Thank you for your contributions.

1 Like