[Benchmarking] - Filtering a datatable using non-native approaches in UiPath Studio

This post describes the datatable filtering benchmark using four non-native datatable filtering approaches in UiPath Studio. The aim of this post is to improve upon a previously reported benchmarking exercise.

To avoid duplication, details regarding input data generation, generic filtering condition and hardware specifications of host can be found in the first post of this performance series.

Integrations in UiPath Studio

Any non-native approach requires additional efforts when integrating with UiPath Studio. In this case, the Start Process activity in UiPath is used to invoke the rust, duckdb, and python scripts. Note that the “Filter Execution Time” is strictly limited to the filter operation and does not include any additional steps such as extraction, loading, writing operations. This ensures that we can compare non-native approaches with native filtering approaches which has been previously reported in this performance series.

CPU processing

Note that all the below external approaches were processed on CPU and not on a GPU (although majority of them support GPU processing). Again, this was done to ensure that the native and non-native approaches can be later compared objectively; apples to apples comparison, so to speak.

Below are the four non-native filter datatable approaches and their corresponding scripts. We would like to thank perplexity LLM for the help in fixing bugs in some of the below scripts.

Rust (v1.27.1)
use polars::{prelude::*, lazy::dsl::col};
use std::env;
use std::time::Instant;
use std::fs::File;
use std::io::Write;

fn main() -> std::io::Result<()> {
    // Setting arguments for input and output csv
    let args: Vec<String> = env::args().collect();
    let csv_path = &args[1];
    let result_path = &args[2];

    // Reading csv files into a polars dataframe
    let df = CsvReader::from_path(csv_path).unwrap().finish().unwrap();

    // use the predicate to filter
    let predicate = col("job").eq(lit("Social worker"));
   
    // Start timing the filter operation
    let start = Instant::now();

    let filtered_df = df
                    .clone()
                    .lazy()
                    .filter(predicate)
                    .collect()
                    .unwrap();

    // End timing and calculate duration
    let duration_seconds = start.elapsed().as_secs_f64();

    println!("{:.6}", duration_seconds);
    let mut output = filtered_df;
    
    // Writing to a csv file
    let mut file = File::create(result_path)?;
    CsvWriter::new(&mut file).finish(&mut output).unwrap();

    // Save the timer value to a text file
    let mut timer_file = File::create("timer_value.txt")?;
    writeln!(timer_file, "{:.6}", duration_seconds)?;

    Ok(())
}
DuckDB (v1.1.13)
-c "
CREATE TABLE TABLENAME AS SELECT * FROM read_csv_auto('FULLFILEPATH');
PRAGMA enable_profiling='json';
PRAGMA profiling_output='timer_value_duckdb.json';
CREATE TABLE FILTERED_TABLE AS SELECT * FROM TABLENAME WHERE job='Social worker';
PRAGMA disable_profiling;
SELECT COUNT(*) FROM FILTERED_TABLE;
COPY FILTERED_TABLE TO 'OUTPUTPATH' (HEADER, DELIMITER ',');
DROP TABLE TABLENAME;
DROP TABLE FILTERED_TABLE;
"
Python-Pandas (v2.2.3)
import pandas as pd
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()

# Read csv
df = pd.read_csv(args.InputCSVPath)

# Time the filter operation
start_time = time.time()
filtered_df = df[df.job == "Social worker"]
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

# Save output csv
filtered_df.to_csv(args.OutputCSVPath, index=False)

# Print and save the elapsed time
print(f"Filter operation took {elapsed_time:.6f} seconds")
with open("timer_value_pandas.txt", "w") as f:
    f.write(f"{elapsed_time:.6f}")
Python-Polars (v1.15.0)
import polars as pl
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()

# Read csv
df = pl.read_csv(args.InputCSVPath)

# Time the filter operation
start_time = time.time()
filtered_df = df.lazy().filter(pl.col("job").str.contains("Social worker"))
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

# Save output csv
filtered_df.write_csv(file=args.OutputCSVPath)

# Print and save the elapsed time
print(f"Filter operation took {elapsed_time:.6f} seconds")
with open("timer_value_polars.txt", "w") as f:
    f.write(f"{elapsed_time:.6f}")

Observations

The observations from three independent runs were post-processed in python to create visualizations of filter execution time in seconds and potential gains in percentages.

The potential gains chart shows the potential gains possible when comparing the bottom ranked approach with the top ranked approach for the given run and input data variation.

The charts and the corresponding observed data are as follows.

First run

Benchmarking table
Tool RowCount FilterExecutionTimeSeconds TotalETLExecutionTimeSeconds RowsKept Rank
PythonPandas 10000 0 1.59072 16 1
DuckDB 10000 0.0003136 0.23235 16 2
Rust 10000 0.000824 1.29801 16 3
PythonPolars 10000 0.003963 1.39 16 4
Rust 40000 0.000546 0.782205 71 1
DuckDB 40000 0.0006904 0.254279 71 2
PythonPandas 40000 0.001992 1.57421 71 3
PythonPolars 40000 0.002965 1.11135 71 4
Rust 160000 0.000891 0.858275 267 1
DuckDB 160000 0.0036428 0.429171 267 2
PythonPolars 160000 0.005001 1.16149 267 3
PythonPandas 160000 0.007999 2.2861 267 4
Rust 640000 0.002142 0.97948 941 1
PythonPolars 640000 0.008004 1.34265 941 2
DuckDB 640000 0.0168354 0.886077 941 3
PythonPandas 640000 0.02652 4.91786 941 4
Rust 2560000 0.009169 1.62762 4073 1
PythonPolars 2560000 0.023038 1.77006 4073 2
PythonPandas 2560000 0.095999 15.0998 4073 3
DuckDB 2560000 0.11907 2.70548 4073 4
Rust 5120000 0.013728 2.62723 7906 1
PythonPolars 5120000 0.030956 2.3495 7906 2
PythonPandas 5120000 0.186999 29.0901 7906 3
DuckDB 5120000 0.190211 4.18416 7906 4
Rust 10240000 0.026845 4.60357 16320 1
PythonPolars 10240000 0.070962 3.63995 16320 2
PythonPandas 10240000 0.380968 56.5589 16320 3
DuckDB 10240000 0.383349 8.3484 16320 4

Potential gains

Potential gains table
RowCount max min PotentialGainsinPercent
0 10000 0.003963 0 100
1 40000 0.002965 0.000546 81.5852
2 160000 0.007999 0.000891 88.8611
3 640000 0.02652 0.002142 91.9231
4 2.56e+06 0.119069 0.009169 92.2995
5 5.12e+06 0.190211 0.013728 92.7828
6 1.024e+07 0.383348 0.026845 92.9972

Second run

Benchmarking table
Tool RowCount FilterExecutionTimeSeconds TotalETLExecutionTimeSeconds RowsKept Rank
DuckDB 10000 0.0003131 0.190182 16 1
Rust 10000 0.000437 0.926346 16 2
PythonPandas 10000 0.000996 1.55787 16 3
PythonPolars 10000 0.003999 1.21663 16 4
Rust 40000 0.00061 0.752303 71 1
DuckDB 40000 0.0007761 0.248306 71 2
PythonPandas 40000 0.001998 1.57128 71 3
PythonPolars 40000 0.003996 1.10915 71 4
Rust 160000 0.000857 0.880076 267 1
DuckDB 160000 0.0032095 0.436043 267 2
PythonPolars 160000 0.006031 1.15024 267 3
PythonPandas 160000 0.006996 2.28893 267 4
Rust 640000 0.002069 1.05303 941 1
PythonPolars 640000 0.008003 1.32438 941 2
DuckDB 640000 0.018133 0.891197 941 3
PythonPandas 640000 0.025001 4.9596 941 4
Rust 2560000 0.00735 1.6418 4073 1
PythonPolars 2560000 0.017002 1.78237 4073 2
DuckDB 2560000 0.0936088 2.47015 4073 3
PythonPandas 2560000 0.095999 15.308 4073 4
Rust 5120000 0.013554 2.48379 7906 1
PythonPolars 5120000 0.044001 2.50702 7906 2
PythonPandas 5120000 0.195 28.8966 7906 3
DuckDB 5120000 0.220705 4.43871 7906 4
Rust 10240000 0.026198 4.66129 16320 1
PythonPolars 10240000 0.066998 3.60019 16320 2
PythonPandas 10240000 0.388001 56.7389 16320 3
DuckDB 10240000 0.405525 8.224 16320 4

Potential gains

Potential gains table
RowCount max min PotentialGainsinPercent
0 10000 0.003999 0.0003131 92.1705
1 40000 0.003996 0.00061 84.7347
2 160000 0.006996 0.000857 87.7501
3 640000 0.025001 0.002069 91.7243
4 2.56e+06 0.095999 0.00735 92.3437
5 5.12e+06 0.220705 0.013554 93.8588
6 1.024e+07 0.405525 0.026198 93.5397

Third run

Benchmarking table
Tool RowCount FilterExecutionTimeSeconds TotalETLExecutionTimeSeconds RowsKept Rank
DuckDB 10000 0.000285 0.17654 16 1
Rust 10000 0.000608 0.945347 16 2
PythonPandas 10000 0.000996 1.58242 16 3
PythonPolars 10000 0.002948 1.22628 16 4
Rust 40000 0.000573 0.870539 71 1
DuckDB 40000 0.0006818 0.262743 71 2
PythonPandas 40000 0.002001 1.54979 71 3
PythonPolars 40000 0.004032 1.11109 71 4
Rust 160000 0.000843 0.929579 267 1
DuckDB 160000 0.0037436 0.465458 267 2
PythonPolars 160000 0.003963 1.21471 267 3
PythonPandas 160000 0.008001 2.24316 267 4
Rust 640000 0.003373 1.05661 941 1
PythonPolars 640000 0.005959 1.32258 941 2
DuckDB 640000 0.0181432 0.937698 941 3
PythonPandas 640000 0.026 4.96609 941 4
Rust 2560000 0.008221 1.68359 4073 1
PythonPolars 2560000 0.017973 1.76549 4073 2
PythonPandas 2560000 0.097 15.1252 4073 3
DuckDB 2560000 0.124572 2.38221 4073 4
Rust 5120000 0.014459 2.48305 7906 1
PythonPolars 5120000 0.030001 2.37348 7906 2
PythonPandas 5120000 0.201997 29.1571 7906 3
DuckDB 5120000 0.214994 4.43892 7906 4
Rust 10240000 0.03147 4.73056 16320 1
PythonPolars 10240000 0.053993 3.62662 16320 2
PythonPandas 10240000 0.388552 58.1226 16320 3
DuckDB 10240000 0.463832 8.5168 16320 4

Potential gains

Potential gains table
RowCount max min PotentialGainsinPercent
0 10000 0.002948 0.000285 90.3324
1 40000 0.004032 0.000573 85.7887
2 160000 0.008001 0.000843 89.4638
3 640000 0.026 0.003373 87.0269
4 2.56e+06 0.124572 0.008221 93.4006
5 5.12e+06 0.214994 0.014459 93.2747
6 1.024e+07 0.463832 0.03147 93.2152

References

Questions

For questions on your specific filtering case, kindly open a new topic and get individual support.

2 Likes