Benchmarking Extract Transform Load (ETL) pipelines in UiPath


Target audience: CoE Leads, RPA Solution Architects and RPA Developers


Reasons to perform such benchmarks in UiPath

  1. The UiPath forum gets a lot of queries regarding the use of Linq queries to mainpulate datatables. I see many developers have fallen for the myth that Linq queries are more performant than the standard Filter Datatable activity. I started of investigating the two aforementioned techniques, but I did want to share my experience with other tools which can outperform any standard UiPath datatable activity or Linq query. I thank @postwick for taking the lead and starting this discussion way back in 2021, and @ppr for helping us bust the myth.
  2. We are developers. If we do not adapt and learn new technologies we are redundant. If you or someone you know is looking for the best performing ETL (Extract, Transform, Load) flow, then they have to evaluate new tools to improve their efficiency.
  3. I started learning Rust and came across Polars crate and how good of a tool it can be for datatable manipulations. The developer of Polars ( Ritchie Vink) wrote about his benchmarking results in this blog and I was sold. I had to try it out.
  4. @StefanSchnell 's integration tutorial also motivated me to show how new tools can be integrated with UiPath without any prebuild libraries or official integrations. Through this post, I can contribute two more integrations (DuckDB and Rust) in addition to the ones mentioned in his tutorial
  5. Even if one of you readers gets motivated and uses some of these tools mentioned here, someday it might help reduce a large amount of robot processing time. There is a lot of potential time savings (without compromising maintainability and robustness) by using some of the newer tools in your automations.

Disclaimers

Disclaimers
  1. I am starting small. The first version of this benchmark was created during my vacation and on an old laptop on an Intel(R) Coreβ„’ i5-4210U CPU @ 1.70GHz with a mere 4GB RAM. Therefore, I have chosen the first benchmark to only test a filter transformation. There is no doubt, this test results would be blazing fast on current multi-core processors, specially the tools which take advantage of new high core count CPU architectures.
  2. I am not advocating against using Linq queries or Filter Datatable activity in UiPath Studio, but I do what the reader to know about newer techniques and allow them to make an informed decision. They (Filter Datatable and Linq queries) are still the easiest to start with as they are fully supported by UiPath, but if you are clearly focused only on performance, then they are not good enough in my view.
  3. In hindsight, I could have used the Testing Project template, but I started with a Windows Standard project in UiPath and did not wish to change it. However, this has not affected the time measurements made in this benchmark.
  4. Although the benchmark files to be included (after some weeks of publishing this tutorial) are well thought through, do not use them in any serious automation. They do not contain any error handling capabilities. I will wait for the community feedback before publishing the benchmark xaml files. May be I can even publish them on UiPath Marketplace and make additional tutorials posts on how to integrate each of them in UiPath.
  5. Csv file as input file is not the newest and best format to test such benchmarks. But since UiPath has out of the box support to read and write csv file, I chose to test on Csv files. Parquet file format is the current go to format in the data science community.
  6. No PC performance monitoring was carried out when the test ran. So I do not have data on the processor usage and ram utilization. May be a grafana server polling the PC performance may have helped in analyzing this.

Creating the input data

To imitate realistic datatables, I used the Faker library in Python. Faker is a handy module which can create fictive data in a fast and easy way. In this case, I iterate over a list of required rows in a datatable and create them in a for loop. To ensure easy integration with UiPath I chose the csv file format.

Python Code - FakeDataGenerator.py

from faker import Faker
import pandas as pd

fake = Faker()
NumRows = [10000, 100000, 250000, 500000, 1000000, 2000000] 
for num in NumRows:
    profileData = [fake.profile() for i in range(num)]
    df = pd.DataFrame(profileData)
    df.to_csv("{0}_input.csv".format(num), index=False)
    print("Saved the file {0}_input.csv".format(num))

print("All dummy data created")

The above code generates the input datatables with row counts of 10000, 100000, 250000, 500000, 1000000, and 2000000. Below is the sample input data used in this benchmark.
Input Data Shape (RowCount * 13 columns)

job company ssn residence current_location blood_group website username name sex address mail birthdate
Print production planner Klein-Caldwell 424-79-6551 252 Davis Forks Apt. 948
Kelleyhaven, NC 18661 (Decimal(β€˜-86.2965425’), Decimal(β€˜-41.336330’)) B+ [β€˜http://www.bird.com/’, β€˜https://www.miller.net/’] watsonsteven Kevin Ball M 34641 Glenn Oval Suite 696
Lake Shellyhaven, ID 68870 erin57@hotmail.com 2011-08-28
Research officer, political party Benton, Thompson and Powell 172-32-9022 5401 Lewis Ville Apt. 542
South Brett, MD 72875 (Decimal(β€˜9.974042’), Decimal(β€˜-179.607076’)) AB- [β€˜https://fuller.com/’, β€˜http://kaiser-parker.com/’, β€˜https://www.garcia.info/’, β€˜http://hill-vasquez.net/’] tiffanyleonard Linda Jones F 83606 Richardson Spring
Scottborough, FM 89592 tammywashington@gmail.com 1973-07-06
Statistician Thomas, Carr and Medina 384-30-9782 4213 Connor Street
Petersport, DC 54921 (Decimal(β€˜-60.1181895’), Decimal(β€˜32.682063’)) O- [β€˜https://phillips.net/’] jimmy18 Manuel Boone M 14996 Brad Village
South Jessicaport, MI 91834 shellyfrancis@yahoo.com 1981-07-14

The filter benchmark

The filter condition used in all approaches translates to the same SQL logic. Return rows where the input table column job has the value 'Social worker'.

SELECT * FROM  TESTDATATABLE
WHERE job='Social worker';
  • The Start and Stop timer activities were used to time only the unit under test (ingesting Csv data, transforming it and saving the result to a new file). This way we remain unbiased and objective while testing these approaches. In short, apples to apples comparison.

  • The execution was performed via the UiPath UiRobot.exe and not run from Studio, this was done to avoid any debugging lags.

    C:\Users\%USER%\AppData\Local\Programs\UiPath\Studio\UiRobot.exe  execute --file "C:\Users\%USER%\Documents\UiPath\_Nugets\BenchmarkingETLApproachesinUiPath.1.0.3.nupkg"  --input "{'in_InputDataFolder':'C:\\Users\\%USER%\\Documents\\UiPath\\InputData'}" --entry "FilterBenchmarker.xaml"
    
  • To test repeatability, 3 runs were executed. Lower the bar (greener the bar), the better performant is the approach to filter a datatable.

In total the filter benchmark tests 7 different approaches / tools.

  1. UiPath Filter Datatable activity
    image

  2. Linq Query in an Assign activity
    InputData.AsEnumerable.Where(Function(r) r.Item("job").ToString.Equals("Social worker")).CopyToDataTable

  3. Running Python script using Pandas library

    import pandas as pd
    import argparse
    
    parser = parser = argparse.ArgumentParser()
    parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
    parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
    args = parser.parse_args()
    
    # Read csv
    df = pd.read_csv(args.InputCSVPath)
    # Apply filter
    filtered_df = df[df.job == "Social worker"]
    # save output csv
    filtered_df.to_csv(args.OutputCSVPath)
    
  4. Running Python script using Polars library

    import polars as pl
    import argparse
    
    parser = parser = argparse.ArgumentParser()
    parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
    parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
    args = parser.parse_args()
    
    # Read csv
    df = pl.read_csv(file=args.InputCSVPath) 
    # Apply filter
    filtered_df = df.filter(pl.col("job").str.contains("Social worker"))
    # save output csv
    filtered_df.write_csv(file=args.OutputCSVPath)
    
    
  5. Running a PowerShell script

    Import-Csv -Path 'INPUTFILE' | Where { $_.job -eq 'Social worker'} | Export-Csv -Path 'OUTPUTFILE' -NoTypeInformation
    
  6. Running the DuckDB executor with using SQL command
    Read more about why DuckDB is great for analytic transformations

    duckdb.exe -c "CREATE OR REPLACE TABLE TABLENAME AS SELECT * FROM read_csv_auto('FULLFILEPATH'); COPY (SELECT * FROM TABLENAME WHERE job='Social worker') TO 'OUTPUTPATH' (HEADER, DELIMITER ',');"
    

    -c tag tells duckdb.exe to run command directly in CLI without opening a new in-memory database

  7. Running a Rust program using Polars crate

    use polars::{prelude::*, lazy::dsl::col};
    use std::env;
    
    fn main() {
        // Setting arguments for input and output csv
        let args: Vec<String> = env::args().collect();
        let csv_path = &args[1];
        let result_path = &args[2];
    
        // Reading csv files into a polars dataframe
        let df = CsvReader::from_path(csv_path).unwrap().finish().unwrap();
    
        // use the predicate to filter
        let predicate = col("job").eq(lit("Social worker"));
       
        let filtered_df = df
                        .clone()
                        .lazy()
                        .filter(predicate )
                        .collect()
                        .unwrap();
    
        let mut output = filtered_df;
        
        // Writing to a csv file
        let mut file = std::fs::File::create(result_path).unwrap();
        CsvWriter::new(&mut file).finish(&mut output).unwrap();
    }
    

    Example command to be used in Start-Process. Remember to use cargo build --release or cargo run --release arguments this will compile the project and then use the release folder to locate the executor (.exe) file see documentation here.

    C:\Users\%USER%\Documents\RustProjects\data_processer\target\release\data_processer.exe  'C:\Users\%USER%\Documents\UiPath\InputData\2000000_input.csv'  'OUTPUT.csv'
    

PC Specifications

Specification Value
Name Intel(R) Coreβ„’ i5-4210U CPU @ 1.70GHz
NumberOfCores 2
NumberOfLogicalProcessors 4
Architecture x64
CurrentClockSpeed 1700
Availability RunningOrFullPower
OsName Microsoft Windows 10 Home Single Language
CsPhyicallyInstalledMemory 4 GB

Results for a single filter operation

This benchmarking project results in an excel file contain all the runs with different tools. To analyze and plot the results I used Python.

Python Code - BenchmarkPlotting.py

import polars as pl
import pandas as pd
import altair as alt

df = pl.read_excel("FilterBenchmarkResults.xlsx", sheet_name="Results")
# Transforming data and creating new columns
df = df.with_columns([
    (pl.col("FileName").str.split(by="_").arr.get(0).cast(pl.Int64)).alias("RowCount"),
    (pl.col("Approach").str.split(by="FilterUsing").arr.get(1)).alias("Tool"),    
])
df = df.with_columns([
  (pl.col("ExecutionTimeSeconds").rank(method="max").over(["RowCount"]).alias("Rank")),  
])
# Sorting the data for better readability
df = df.sort(["RowCount","ExecutionTimeSeconds"])

# as altair does not support Polars dataframes converting to pandas dataframe
final_df= df.to_pandas()

# Plotting the results in a altair bar chart
chart = alt.Chart( ).transform_calculate(
        ToolTip="datum.Tool+'='+datum.ExecutionTimeSeconds").mark_bar().encode( 
        x=alt.X('Tool:N', axis=alt.Axis(labelAngle=-45, title=None)),
        y=alt.Y('ExecutionTimeSeconds:Q'),
        color=alt.Color('Rank:Q',scale=alt.Scale(scheme="blues", reverse=True)),
        tooltip="ToolTip:N",
        order='Rank:Q',
    )
    
text = chart.mark_text(
        color = 'black',
        dy= -5
    ).encode(
        text = alt.Text(
            'ExecutionTimeSeconds:Q',
            format = ',.0f')
    )
    
fig = alt.layer(chart, text, data=final_df).facet(
    column=alt.Column(
        'RowCount:Q', 
        )
    ).configure_facet(
    spacing=20
).configure_axis(
        grid=False
    ).configure_view(
        strokeWidth=0.4
    ).interactive()

fig.save('BenchmarkFilterDatatable.html')

Run 1 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.863817 184 1
Rust 100000 0.992763 184 2
PythonPolars 100000 1.62365 184 3
FilterDataTable 100000 2.28531 184 4
Linq 100000 2.60637 184 5
PowerShell 100000 6.74261 184 6
PythonPandas 100000 7.11996 184 7
DuckDB 250000 1.37164 413 1
Rust 250000 1.64754 413 2
PythonPolars 250000 2.01796 413 3
PythonPandas 250000 4.97285 413 4
FilterDataTable 250000 7.63237 413 5
Linq 250000 8.04975 413 6
PowerShell 250000 14.5997 413 7
Rust 500000 1.99611 762 1
PythonPolars 500000 2.02914 762 2
DuckDB 500000 2.10029 762 3
PythonPandas 500000 7.38287 762 4
Linq 500000 11.5408 762 5
FilterDataTable 500000 11.6786 762 6
PowerShell 500000 27.767 762 7
Rust 1000000 2.04737 1511 1
PythonPolars 1000000 3.09737 1511 2
DuckDB 1000000 3.74764 1511 3
PythonPandas 1000000 13.0274 1511 4
FilterDataTable 1000000 23.8323 1511 5
Linq 1000000 23.9049 1511 6
PowerShell 1000000 55.9658 1511 7
Rust 2000000 4.59196 3126 1
PythonPolars 2000000 5.40843 3126 2
DuckDB 2000000 7.18215 3126 3
PythonPandas 2000000 25.0529 3126 4
FilterDataTable 2000000 49.568 3126 5
Linq 2000000 57.2253 3126 6
PowerShell 2000000 109.237 3126 7

Run 2 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.933689 184 1
Rust 100000 1.17638 184 2
PythonPolars 100000 1.32587 184 3
FilterDataTable 100000 2.0938 184 4
Linq 100000 2.871 184 5
PythonPandas 100000 4.74992 184 6
PowerShell 100000 6.53177 184 7
Rust 250000 1.26209 413 1
PythonPolars 250000 1.70902 413 2
DuckDB 250000 1.73422 413 3
PythonPandas 250000 4.72671 413 4
Linq 250000 5.84544 413 5
FilterDataTable 250000 6.75782 413 6
PowerShell 250000 13.9554 413 7
Rust 500000 1.62974 762 1
PythonPolars 500000 1.92046 762 2
DuckDB 500000 2.87362 762 3
PythonPandas 500000 6.79275 762 4
Linq 500000 10.0704 762 5
FilterDataTable 500000 10.0898 762 6
PowerShell 500000 27.0774 762 7
Rust 1000000 2.7926 1511 1
PythonPolars 1000000 2.90596 1511 2
DuckDB 1000000 4.03603 1511 3
PythonPandas 1000000 12.0829 1511 4
Linq 1000000 21.4446 1511 5
FilterDataTable 1000000 21.8539 1511 6
PowerShell 1000000 55.4838 1511 7
PythonPolars 2000000 4.68916 3126 1
Rust 2000000 5.62342 3126 2
DuckDB 2000000 7.57531 3126 3
PythonPandas 2000000 22.729 3126 4
FilterDataTable 2000000 51.2262 3126 5
Linq 2000000 54.8478 3126 6
PowerShell 2000000 106.559 3126 7

Run 3 - benchmark results table
Tool RowCount ExecutionTimeSeconds RowsKept Rank
DuckDB 100000 0.718989 184 1
Rust 100000 1.10113 184 2
PythonPolars 100000 1.37914 184 3
Linq 100000 1.99643 184 4
FilterDataTable 100000 2.00937 184 5
PythonPandas 100000 4.08225 184 6
PowerShell 100000 6.5047 184 7
Rust 250000 1.21478 413 1
DuckDB 250000 1.2367 413 2
PythonPolars 250000 1.74341 413 3
PythonPandas 250000 5.12249 413 4
FilterDataTable 250000 6.97901 413 5
Linq 250000 7.85241 413 6
PowerShell 250000 14.1751 413 7
Rust 500000 1.55787 762 1
PythonPolars 500000 1.88452 762 2
DuckDB 500000 1.88637 762 3
PythonPandas 500000 6.73607 762 4
Linq 500000 9.94437 762 5
FilterDataTable 500000 10.909 762 6
PowerShell 500000 27.1955 762 7
Rust 1000000 2.60658 1511 1
PythonPolars 1000000 3.02078 1511 2
DuckDB 1000000 3.59157 1511 3
PythonPandas 1000000 12.406 1511 4
Linq 1000000 20.9774 1511 5
FilterDataTable 1000000 21.4426 1511 6
PowerShell 1000000 53.9456 1511 7
PythonPolars 2000000 4.94405 3126 1
Rust 2000000 5.35049 3126 2
DuckDB 2000000 8.59179 3126 3
PythonPandas 2000000 26.4206 3126 4
FilterDataTable 2000000 50.3969 3126 5
Linq 2000000 77.3886 3126 6
PowerShell 2000000 107.122 3126 7

Potential gains

The results need to also show us the potential gains when comparing the best and the worst performing tool in the benchmark. To do this I use the Pandas groupby and pct_group methods.
Python Code - CalculatePotentialGains.py continued from BenchmarkPlotting.py

# Selecting only certain columns
output_table = df.select(["Tool","RowCount", "ExecutionTimeSeconds", "RowsKept", "Rank"]).to_pandas()

# Potential gains to be made when comparing to min and max values per group
pct_group = output_table.groupby(['RowCount'])['ExecutionTimeSeconds'].agg(['max','min'])
pct_group['PotentialGainsinPercent'] = ((pct_group['max']-pct_group['min'])/pct_group['max'])*100
pct_group = pct_group.reset_index(level=0)

# Saving as markdown files to attach with tutorial
output_table.to_markdown('BenchmarkFilterDatatable.md', index=False)
pct_group.to_markdown('PotentialGainsinPercent.md')

# Plotting the results in a altair bar chart
chart = alt.Chart(pct_group).mark_bar().encode(
    x=alt.X('PotentialGainsinPercent:Q', axis=alt.Axis(
        labelAngle=-45, title="Percentage")),
    y=alt.Y('RowCount:N'),
    color=alt.Color('PotentialGainsinPercent:Q',
                    scale=alt.Scale(scheme=color, reverse=False)),
    tooltip="PotentialGainsinPercent:Q",
)

text = chart.mark_text(
    color='black',
    dx=10
).encode(
    text=alt.Text(
        'PotentialGainsinPercent:Q',
        format=',.0f')
)


fig = chart + text
fig.configure_axis(
    grid=False
).configure_scale(
    bandPaddingInner=0.5
).configure_view(
    strokeWidth=0.4
).interactive()

fig.save('PotentialGainsChart.html')

Run1_PotentialGains

Run 1 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 7.11996 0.863817 87.8677
1 250000 14.5997 1.37164 90.605
2 500000 27.767 1.99611 92.8112
3 1000000 55.9658 2.04737 96.3417
4 2000000 109.237 4.59196 95.7963

Run2_PotentialGains

Run 2 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 6.53177 0.933689 85.7054
1 250000 13.9554 1.26209 90.9562
2 500000 27.0774 1.62974 93.9812
3 1000000 55.4838 2.7926 94.9668
4 2000000 106.559 4.68916 95.5995

Run3_PotentialGains

Run 3 - Potential Gains in Percent
RowCount max min PotentialGainsinPercent
0 100000 6.5047 0.718989 88.9466
1 250000 14.1751 1.21478 91.4302
2 500000 27.1955 1.55787 94.2716
3 1000000 53.9456 2.60658 95.1681
4 2000000 107.122 4.94405 95.3846

Observations

From the results, the three good performing tools (when considering execution time) are

  1. Complied Rust executor
  2. DuckDB
  3. Python script using Polars

Rust executor was plain superior
I did notice that when integrated with UiPath the Rust executor was slower than when the same executor was run via PowerShell. For example, when I run manually from PowerShell the execution time to filter for 2000000 rows was 4.26, 4.51 and 3.80 seconds for 3 runs, which is lower than the results it achieved when run from Start-Process in UiPath. A possible reason for this could be because Start-Process activity first opens CMD to start PowerShell and then runs the rust executor. This is probably why Rust executor did not dwarf other approaches on all datasets.

Another challenge is that Rust as a language is not as easy to get started compared to say Python. Creating your own custom Rust programs with Polars crate will also add to development time. Imagine translating all the available Datatable activities in UiPath Studio into Rust programs. Performance will be great under runtime, but at the cost of additional development time. Although once created you can use them in all your projects.

On the contrary, DuckDB ran equally well both in stand-alone and when integrated with UiPath. To see Polars in Python performing so well was surprising to say the least. Linq query and FilterDatatable activity were much slower than most of the external tools.

Again, as found in the Myth busting post from @postwick, it is clear from this benchmark that using Linq queries over standard UiPath datatable activities does not improve performance drastically. In fact, on large datatables, standard Filter Datatable activity performed better than Linq query! For smaller datatables, they both are very close when it comes to execution performance.

In all three runs, PowerShell performed the worst out of all the approaches. Do not use PowerShell scripts for any datatable transformations, its slow, very slow!

For large datatables, the best performing approach improved performance by 96% when compared to the slowest approach. This is impactful in the RPA domain where there are millions of robot transactions and each datatable transformation could potentially save over 80% of execution time (given that these results will also replicate on other kinds of transformations).

Tasting my own medicine
If I had to convince my team to change our filtering logic in our automations, I would choose DuckDB for the following reasons

  1. The easiest of the three to integrate with UiPath (no compiling or virtual environments to worry about)
  2. We could use our SQL expert in the team to write optimized queries and drastically improve our transformations both documentation wise and performance wise
  3. The DuckDB executor is tiny, just 24 megabytes and does not need to be packed in the process. It can be downloaded during robot runtime.
  4. It can be easily updated (new version) by the robot without the need to update the published process
  5. It provides an option to save database in different formats (parquet as well) on a shared drive if we needed to

Your feedback

Thank for reading this post.

I would like to call on CoE leads, Solution Architects and Developers to analyze their robot utilization. If a large chuck of your robot utilization is due to processing of datatables, then you should definitely evaluate all three approaches on your use cases; Rust executor, DuckDB and Python script using Polars module.

Feel free to provide feedback to improve this benchmark. If there are things I overlooked, I can fix them before I publish the xaml files either here or in GitHub. My wish is to run this benchmark on a pc with better specifications and update the results in this post.

3 Likes