This post describes the datatable filtering benchmark using four non-native datatable filtering approaches in UiPath Studio. The aim of this post is to improve upon a previously reported benchmarking exercise.
To avoid duplication, details regarding input data generation, generic filtering condition and hardware specifications of host can be found in the first post of this performance series.
Integrations in UiPath Studio
Any non-native approach requires additional efforts when integrating with UiPath Studio. In this case, the Start Process
activity in UiPath is used to invoke the rust, duckdb, and python scripts. Note that the “Filter Execution Time” is strictly limited to the filter operation and does not include any additional steps such as extraction, loading, writing operations. This ensures that we can compare non-native approaches with native filtering approaches which has been previously reported in this performance series.
CPU processing
Note that all the below external approaches were processed on CPU and not on a GPU (although majority of them support GPU processing). Again, this was done to ensure that the native and non-native approaches can be later compared objectively; apples to apples comparison, so to speak.
Below are the four non-native filter datatable approaches and their corresponding scripts. We would like to thank perplexity LLM for the help in fixing bugs in some of the below scripts.
Rust (v1.27.1)
use polars::{prelude::*, lazy::dsl::col};
use std::env;
use std::time::Instant;
use std::fs::File;
use std::io::Write;
fn main() -> std::io::Result<()> {
// Setting arguments for input and output csv
let args: Vec<String> = env::args().collect();
let csv_path = &args[1];
let result_path = &args[2];
// Reading csv files into a polars dataframe
let df = CsvReader::from_path(csv_path).unwrap().finish().unwrap();
// use the predicate to filter
let predicate = col("job").eq(lit("Social worker"));
// Start timing the filter operation
let start = Instant::now();
let filtered_df = df
.clone()
.lazy()
.filter(predicate)
.collect()
.unwrap();
// End timing and calculate duration
let duration_seconds = start.elapsed().as_secs_f64();
println!("{:.6}", duration_seconds);
let mut output = filtered_df;
// Writing to a csv file
let mut file = File::create(result_path)?;
CsvWriter::new(&mut file).finish(&mut output).unwrap();
// Save the timer value to a text file
let mut timer_file = File::create("timer_value.txt")?;
writeln!(timer_file, "{:.6}", duration_seconds)?;
Ok(())
}
DuckDB (v1.1.13)
-c "
CREATE TABLE TABLENAME AS SELECT * FROM read_csv_auto('FULLFILEPATH');
PRAGMA enable_profiling='json';
PRAGMA profiling_output='timer_value_duckdb.json';
CREATE TABLE FILTERED_TABLE AS SELECT * FROM TABLENAME WHERE job='Social worker';
PRAGMA disable_profiling;
SELECT COUNT(*) FROM FILTERED_TABLE;
COPY FILTERED_TABLE TO 'OUTPUTPATH' (HEADER, DELIMITER ',');
DROP TABLE TABLENAME;
DROP TABLE FILTERED_TABLE;
"
Python-Pandas (v2.2.3)
import pandas as pd
import argparse
import time
parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()
# Read csv
df = pd.read_csv(args.InputCSVPath)
# Time the filter operation
start_time = time.time()
filtered_df = df[df.job == "Social worker"]
end_time = time.time()
# Calculate elapsed time
elapsed_time = end_time - start_time
# Save output csv
filtered_df.to_csv(args.OutputCSVPath, index=False)
# Print and save the elapsed time
print(f"Filter operation took {elapsed_time:.6f} seconds")
with open("timer_value_pandas.txt", "w") as f:
f.write(f"{elapsed_time:.6f}")
Python-Polars (v1.15.0)
import polars as pl
import argparse
import time
parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()
# Read csv
df = pl.read_csv(args.InputCSVPath)
# Time the filter operation
start_time = time.time()
filtered_df = df.lazy().filter(pl.col("job").str.contains("Social worker"))
end_time = time.time()
# Calculate elapsed time
elapsed_time = end_time - start_time
# Save output csv
filtered_df.write_csv(file=args.OutputCSVPath)
# Print and save the elapsed time
print(f"Filter operation took {elapsed_time:.6f} seconds")
with open("timer_value_polars.txt", "w") as f:
f.write(f"{elapsed_time:.6f}")
Observations
The observations from three independent runs were post-processed in python to create visualizations of filter execution time in seconds and potential gains in percentages.
The potential gains chart shows the potential gains possible when comparing the bottom ranked approach with the top ranked approach for the given run and input data variation.
The charts and the corresponding observed data are as follows.
First run
Benchmarking table
Tool | RowCount | FilterExecutionTimeSeconds | TotalETLExecutionTimeSeconds | RowsKept | Rank |
---|---|---|---|---|---|
PythonPandas | 10000 | 0 | 1.59072 | 16 | 1 |
DuckDB | 10000 | 0.0003136 | 0.23235 | 16 | 2 |
Rust | 10000 | 0.000824 | 1.29801 | 16 | 3 |
PythonPolars | 10000 | 0.003963 | 1.39 | 16 | 4 |
Rust | 40000 | 0.000546 | 0.782205 | 71 | 1 |
DuckDB | 40000 | 0.0006904 | 0.254279 | 71 | 2 |
PythonPandas | 40000 | 0.001992 | 1.57421 | 71 | 3 |
PythonPolars | 40000 | 0.002965 | 1.11135 | 71 | 4 |
Rust | 160000 | 0.000891 | 0.858275 | 267 | 1 |
DuckDB | 160000 | 0.0036428 | 0.429171 | 267 | 2 |
PythonPolars | 160000 | 0.005001 | 1.16149 | 267 | 3 |
PythonPandas | 160000 | 0.007999 | 2.2861 | 267 | 4 |
Rust | 640000 | 0.002142 | 0.97948 | 941 | 1 |
PythonPolars | 640000 | 0.008004 | 1.34265 | 941 | 2 |
DuckDB | 640000 | 0.0168354 | 0.886077 | 941 | 3 |
PythonPandas | 640000 | 0.02652 | 4.91786 | 941 | 4 |
Rust | 2560000 | 0.009169 | 1.62762 | 4073 | 1 |
PythonPolars | 2560000 | 0.023038 | 1.77006 | 4073 | 2 |
PythonPandas | 2560000 | 0.095999 | 15.0998 | 4073 | 3 |
DuckDB | 2560000 | 0.11907 | 2.70548 | 4073 | 4 |
Rust | 5120000 | 0.013728 | 2.62723 | 7906 | 1 |
PythonPolars | 5120000 | 0.030956 | 2.3495 | 7906 | 2 |
PythonPandas | 5120000 | 0.186999 | 29.0901 | 7906 | 3 |
DuckDB | 5120000 | 0.190211 | 4.18416 | 7906 | 4 |
Rust | 10240000 | 0.026845 | 4.60357 | 16320 | 1 |
PythonPolars | 10240000 | 0.070962 | 3.63995 | 16320 | 2 |
PythonPandas | 10240000 | 0.380968 | 56.5589 | 16320 | 3 |
DuckDB | 10240000 | 0.383349 | 8.3484 | 16320 | 4 |
Potential gains
Potential gains table
RowCount | max | min | PotentialGainsinPercent | |
---|---|---|---|---|
0 | 10000 | 0.003963 | 0 | 100 |
1 | 40000 | 0.002965 | 0.000546 | 81.5852 |
2 | 160000 | 0.007999 | 0.000891 | 88.8611 |
3 | 640000 | 0.02652 | 0.002142 | 91.9231 |
4 | 2.56e+06 | 0.119069 | 0.009169 | 92.2995 |
5 | 5.12e+06 | 0.190211 | 0.013728 | 92.7828 |
6 | 1.024e+07 | 0.383348 | 0.026845 | 92.9972 |
Second run
Benchmarking table
Tool | RowCount | FilterExecutionTimeSeconds | TotalETLExecutionTimeSeconds | RowsKept | Rank |
---|---|---|---|---|---|
DuckDB | 10000 | 0.0003131 | 0.190182 | 16 | 1 |
Rust | 10000 | 0.000437 | 0.926346 | 16 | 2 |
PythonPandas | 10000 | 0.000996 | 1.55787 | 16 | 3 |
PythonPolars | 10000 | 0.003999 | 1.21663 | 16 | 4 |
Rust | 40000 | 0.00061 | 0.752303 | 71 | 1 |
DuckDB | 40000 | 0.0007761 | 0.248306 | 71 | 2 |
PythonPandas | 40000 | 0.001998 | 1.57128 | 71 | 3 |
PythonPolars | 40000 | 0.003996 | 1.10915 | 71 | 4 |
Rust | 160000 | 0.000857 | 0.880076 | 267 | 1 |
DuckDB | 160000 | 0.0032095 | 0.436043 | 267 | 2 |
PythonPolars | 160000 | 0.006031 | 1.15024 | 267 | 3 |
PythonPandas | 160000 | 0.006996 | 2.28893 | 267 | 4 |
Rust | 640000 | 0.002069 | 1.05303 | 941 | 1 |
PythonPolars | 640000 | 0.008003 | 1.32438 | 941 | 2 |
DuckDB | 640000 | 0.018133 | 0.891197 | 941 | 3 |
PythonPandas | 640000 | 0.025001 | 4.9596 | 941 | 4 |
Rust | 2560000 | 0.00735 | 1.6418 | 4073 | 1 |
PythonPolars | 2560000 | 0.017002 | 1.78237 | 4073 | 2 |
DuckDB | 2560000 | 0.0936088 | 2.47015 | 4073 | 3 |
PythonPandas | 2560000 | 0.095999 | 15.308 | 4073 | 4 |
Rust | 5120000 | 0.013554 | 2.48379 | 7906 | 1 |
PythonPolars | 5120000 | 0.044001 | 2.50702 | 7906 | 2 |
PythonPandas | 5120000 | 0.195 | 28.8966 | 7906 | 3 |
DuckDB | 5120000 | 0.220705 | 4.43871 | 7906 | 4 |
Rust | 10240000 | 0.026198 | 4.66129 | 16320 | 1 |
PythonPolars | 10240000 | 0.066998 | 3.60019 | 16320 | 2 |
PythonPandas | 10240000 | 0.388001 | 56.7389 | 16320 | 3 |
DuckDB | 10240000 | 0.405525 | 8.224 | 16320 | 4 |
Potential gains
Potential gains table
RowCount | max | min | PotentialGainsinPercent | |
---|---|---|---|---|
0 | 10000 | 0.003999 | 0.0003131 | 92.1705 |
1 | 40000 | 0.003996 | 0.00061 | 84.7347 |
2 | 160000 | 0.006996 | 0.000857 | 87.7501 |
3 | 640000 | 0.025001 | 0.002069 | 91.7243 |
4 | 2.56e+06 | 0.095999 | 0.00735 | 92.3437 |
5 | 5.12e+06 | 0.220705 | 0.013554 | 93.8588 |
6 | 1.024e+07 | 0.405525 | 0.026198 | 93.5397 |
Third run
Benchmarking table
Tool | RowCount | FilterExecutionTimeSeconds | TotalETLExecutionTimeSeconds | RowsKept | Rank |
---|---|---|---|---|---|
DuckDB | 10000 | 0.000285 | 0.17654 | 16 | 1 |
Rust | 10000 | 0.000608 | 0.945347 | 16 | 2 |
PythonPandas | 10000 | 0.000996 | 1.58242 | 16 | 3 |
PythonPolars | 10000 | 0.002948 | 1.22628 | 16 | 4 |
Rust | 40000 | 0.000573 | 0.870539 | 71 | 1 |
DuckDB | 40000 | 0.0006818 | 0.262743 | 71 | 2 |
PythonPandas | 40000 | 0.002001 | 1.54979 | 71 | 3 |
PythonPolars | 40000 | 0.004032 | 1.11109 | 71 | 4 |
Rust | 160000 | 0.000843 | 0.929579 | 267 | 1 |
DuckDB | 160000 | 0.0037436 | 0.465458 | 267 | 2 |
PythonPolars | 160000 | 0.003963 | 1.21471 | 267 | 3 |
PythonPandas | 160000 | 0.008001 | 2.24316 | 267 | 4 |
Rust | 640000 | 0.003373 | 1.05661 | 941 | 1 |
PythonPolars | 640000 | 0.005959 | 1.32258 | 941 | 2 |
DuckDB | 640000 | 0.0181432 | 0.937698 | 941 | 3 |
PythonPandas | 640000 | 0.026 | 4.96609 | 941 | 4 |
Rust | 2560000 | 0.008221 | 1.68359 | 4073 | 1 |
PythonPolars | 2560000 | 0.017973 | 1.76549 | 4073 | 2 |
PythonPandas | 2560000 | 0.097 | 15.1252 | 4073 | 3 |
DuckDB | 2560000 | 0.124572 | 2.38221 | 4073 | 4 |
Rust | 5120000 | 0.014459 | 2.48305 | 7906 | 1 |
PythonPolars | 5120000 | 0.030001 | 2.37348 | 7906 | 2 |
PythonPandas | 5120000 | 0.201997 | 29.1571 | 7906 | 3 |
DuckDB | 5120000 | 0.214994 | 4.43892 | 7906 | 4 |
Rust | 10240000 | 0.03147 | 4.73056 | 16320 | 1 |
PythonPolars | 10240000 | 0.053993 | 3.62662 | 16320 | 2 |
PythonPandas | 10240000 | 0.388552 | 58.1226 | 16320 | 3 |
DuckDB | 10240000 | 0.463832 | 8.5168 | 16320 | 4 |
Potential gains
Potential gains table
RowCount | max | min | PotentialGainsinPercent | |
---|---|---|---|---|
0 | 10000 | 0.002948 | 0.000285 | 90.3324 |
1 | 40000 | 0.004032 | 0.000573 | 85.7887 |
2 | 160000 | 0.008001 | 0.000843 | 89.4638 |
3 | 640000 | 0.026 | 0.003373 | 87.0269 |
4 | 2.56e+06 | 0.124572 | 0.008221 | 93.4006 |
5 | 5.12e+06 | 0.214994 | 0.014459 | 93.2747 |
6 | 1.024e+07 | 0.463832 | 0.03147 | 93.2152 |
References
Questions
For questions on your specific filtering case, kindly open a new topic and get individual support.