[Benchmarking] - Filtering a datatable using non-native approaches in UiPath Studio

jeevith · November 26, 2024, 7:59pm

This post describes the datatable filtering benchmark using four non-native datatable filtering approaches in UiPath Studio. The aim of this post is to improve upon a previously reported benchmarking exercise.

To avoid duplication, details regarding input data generation, generic filtering condition and hardware specifications of host can be found in the first post of this performance series.

Integrations in UiPath Studio

Any non-native approach requires additional efforts when integrating with UiPath Studio. In this case, the Start Process activity in UiPath is used to invoke the rust, duckdb, and python scripts. Note that the “Filter Execution Time” is strictly limited to the filter operation and does not include any additional steps such as extraction, loading, writing operations. This ensures that we can compare non-native approaches with native filtering approaches which has been previously reported in this performance series.

CPU processing

Note that all the below external approaches were processed on CPU and not on a GPU (although majority of them support GPU processing). Again, this was done to ensure that the native and non-native approaches can be later compared objectively; apples to apples comparison, so to speak.

Below are the four non-native filter datatable approaches and their corresponding scripts. We would like to thank perplexity LLM for the help in fixing bugs in some of the below scripts.

Rust (v1.27.1)

use polars::{prelude::*, lazy::dsl::col};
use std::env;
use std::time::Instant;
use std::fs::File;
use std::io::Write;

fn main() -> std::io::Result<()> {
    // Setting arguments for input and output csv
    let args: Vec<String> = env::args().collect();
    let csv_path = &args[1];
    let result_path = &args[2];

    // Reading csv files into a polars dataframe
    let df = CsvReader::from_path(csv_path).unwrap().finish().unwrap();

    // use the predicate to filter
    let predicate = col("job").eq(lit("Social worker"));
   
    // Start timing the filter operation
    let start = Instant::now();

    let filtered_df = df
                    .clone()
                    .lazy()
                    .filter(predicate)
                    .collect()
                    .unwrap();

    // End timing and calculate duration
    let duration_seconds = start.elapsed().as_secs_f64();

    println!("{:.6}", duration_seconds);
    let mut output = filtered_df;
    
    // Writing to a csv file
    let mut file = File::create(result_path)?;
    CsvWriter::new(&mut file).finish(&mut output).unwrap();

    // Save the timer value to a text file
    let mut timer_file = File::create("timer_value.txt")?;
    writeln!(timer_file, "{:.6}", duration_seconds)?;

    Ok(())
}

DuckDB (v1.1.13)

-c "
CREATE TABLE TABLENAME AS SELECT * FROM read_csv_auto('FULLFILEPATH');
PRAGMA enable_profiling='json';
PRAGMA profiling_output='timer_value_duckdb.json';
CREATE TABLE FILTERED_TABLE AS SELECT * FROM TABLENAME WHERE job='Social worker';
PRAGMA disable_profiling;
SELECT COUNT(*) FROM FILTERED_TABLE;
COPY FILTERED_TABLE TO 'OUTPUTPATH' (HEADER, DELIMITER ',');
DROP TABLE TABLENAME;
DROP TABLE FILTERED_TABLE;
"

Python-Pandas (v2.2.3)

import pandas as pd
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()

# Read csv
df = pd.read_csv(args.InputCSVPath)

# Time the filter operation
start_time = time.time()
filtered_df = df[df.job == "Social worker"]
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

# Save output csv
filtered_df.to_csv(args.OutputCSVPath, index=False)

# Print and save the elapsed time
print(f"Filter operation took {elapsed_time:.6f} seconds")
with open("timer_value_pandas.txt", "w") as f:
    f.write(f"{elapsed_time:.6f}")

Python-Polars (v1.15.0)

import polars as pl
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("InputCSVPath", help="The path to the input csv file", type=str)
parser.add_argument("OutputCSVPath", help="The path to the output csv file", type=str)
args = parser.parse_args()

# Read csv
df = pl.read_csv(args.InputCSVPath)

# Time the filter operation
start_time = time.time()
filtered_df = df.lazy().filter(pl.col("job").str.contains("Social worker"))
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

# Save output csv
filtered_df.write_csv(file=args.OutputCSVPath)

# Print and save the elapsed time
print(f"Filter operation took {elapsed_time:.6f} seconds")
with open("timer_value_polars.txt", "w") as f:
    f.write(f"{elapsed_time:.6f}")

Observations

The observations from three independent runs were post-processed in python to create visualizations of filter execution time in seconds and potential gains in percentages.

The potential gains chart shows the potential gains possible when comparing the bottom ranked approach with the top ranked approach for the given run and input data variation.

The charts and the corresponding observed data are as follows.

First run

Benchmarking table

Tool	RowCount	FilterExecutionTimeSeconds	TotalETLExecutionTimeSeconds	RowsKept	Rank
PythonPandas	10000	0	1.59072	16	1
DuckDB	10000	0.0003136	0.23235	16	2
Rust	10000	0.000824	1.29801	16	3
PythonPolars	10000	0.003963	1.39	16	4
Rust	40000	0.000546	0.782205	71	1
DuckDB	40000	0.0006904	0.254279	71	2
PythonPandas	40000	0.001992	1.57421	71	3
PythonPolars	40000	0.002965	1.11135	71	4
Rust	160000	0.000891	0.858275	267	1
DuckDB	160000	0.0036428	0.429171	267	2
PythonPolars	160000	0.005001	1.16149	267	3
PythonPandas	160000	0.007999	2.2861	267	4
Rust	640000	0.002142	0.97948	941	1
PythonPolars	640000	0.008004	1.34265	941	2
DuckDB	640000	0.0168354	0.886077	941	3
PythonPandas	640000	0.02652	4.91786	941	4
Rust	2560000	0.009169	1.62762	4073	1
PythonPolars	2560000	0.023038	1.77006	4073	2
PythonPandas	2560000	0.095999	15.0998	4073	3
DuckDB	2560000	0.11907	2.70548	4073	4
Rust	5120000	0.013728	2.62723	7906	1
PythonPolars	5120000	0.030956	2.3495	7906	2
PythonPandas	5120000	0.186999	29.0901	7906	3
DuckDB	5120000	0.190211	4.18416	7906	4
Rust	10240000	0.026845	4.60357	16320	1
PythonPolars	10240000	0.070962	3.63995	16320	2
PythonPandas	10240000	0.380968	56.5589	16320	3
DuckDB	10240000	0.383349	8.3484	16320	4

Potential gains

Potential gains table

	RowCount	max	min	PotentialGainsinPercent
0	10000	0.003963	0	100
1	40000	0.002965	0.000546	81.5852
2	160000	0.007999	0.000891	88.8611
3	640000	0.02652	0.002142	91.9231
4	2.56e+06	0.119069	0.009169	92.2995
5	5.12e+06	0.190211	0.013728	92.7828
6	1.024e+07	0.383348	0.026845	92.9972

Second run

Benchmarking table

Tool	RowCount	FilterExecutionTimeSeconds	TotalETLExecutionTimeSeconds	RowsKept	Rank
DuckDB	10000	0.0003131	0.190182	16	1
Rust	10000	0.000437	0.926346	16	2
PythonPandas	10000	0.000996	1.55787	16	3
PythonPolars	10000	0.003999	1.21663	16	4
Rust	40000	0.00061	0.752303	71	1
DuckDB	40000	0.0007761	0.248306	71	2
PythonPandas	40000	0.001998	1.57128	71	3
PythonPolars	40000	0.003996	1.10915	71	4
Rust	160000	0.000857	0.880076	267	1
DuckDB	160000	0.0032095	0.436043	267	2
PythonPolars	160000	0.006031	1.15024	267	3
PythonPandas	160000	0.006996	2.28893	267	4
Rust	640000	0.002069	1.05303	941	1
PythonPolars	640000	0.008003	1.32438	941	2
DuckDB	640000	0.018133	0.891197	941	3
PythonPandas	640000	0.025001	4.9596	941	4
Rust	2560000	0.00735	1.6418	4073	1
PythonPolars	2560000	0.017002	1.78237	4073	2
DuckDB	2560000	0.0936088	2.47015	4073	3
PythonPandas	2560000	0.095999	15.308	4073	4
Rust	5120000	0.013554	2.48379	7906	1
PythonPolars	5120000	0.044001	2.50702	7906	2
PythonPandas	5120000	0.195	28.8966	7906	3
DuckDB	5120000	0.220705	4.43871	7906	4
Rust	10240000	0.026198	4.66129	16320	1
PythonPolars	10240000	0.066998	3.60019	16320	2
PythonPandas	10240000	0.388001	56.7389	16320	3
DuckDB	10240000	0.405525	8.224	16320	4

Potential gains

Potential gains table

	RowCount	max	min	PotentialGainsinPercent
0	10000	0.003999	0.0003131	92.1705
1	40000	0.003996	0.00061	84.7347
2	160000	0.006996	0.000857	87.7501
3	640000	0.025001	0.002069	91.7243
4	2.56e+06	0.095999	0.00735	92.3437
5	5.12e+06	0.220705	0.013554	93.8588
6	1.024e+07	0.405525	0.026198	93.5397

Third run

Benchmarking table

Tool	RowCount	FilterExecutionTimeSeconds	TotalETLExecutionTimeSeconds	RowsKept	Rank
DuckDB	10000	0.000285	0.17654	16	1
Rust	10000	0.000608	0.945347	16	2
PythonPandas	10000	0.000996	1.58242	16	3
PythonPolars	10000	0.002948	1.22628	16	4
Rust	40000	0.000573	0.870539	71	1
DuckDB	40000	0.0006818	0.262743	71	2
PythonPandas	40000	0.002001	1.54979	71	3
PythonPolars	40000	0.004032	1.11109	71	4
Rust	160000	0.000843	0.929579	267	1
DuckDB	160000	0.0037436	0.465458	267	2
PythonPolars	160000	0.003963	1.21471	267	3
PythonPandas	160000	0.008001	2.24316	267	4
Rust	640000	0.003373	1.05661	941	1
PythonPolars	640000	0.005959	1.32258	941	2
DuckDB	640000	0.0181432	0.937698	941	3
PythonPandas	640000	0.026	4.96609	941	4
Rust	2560000	0.008221	1.68359	4073	1
PythonPolars	2560000	0.017973	1.76549	4073	2
PythonPandas	2560000	0.097	15.1252	4073	3
DuckDB	2560000	0.124572	2.38221	4073	4
Rust	5120000	0.014459	2.48305	7906	1
PythonPolars	5120000	0.030001	2.37348	7906	2
PythonPandas	5120000	0.201997	29.1571	7906	3
DuckDB	5120000	0.214994	4.43892	7906	4
Rust	10240000	0.03147	4.73056	16320	1
PythonPolars	10240000	0.053993	3.62662	16320	2
PythonPandas	10240000	0.388552	58.1226	16320	3
DuckDB	10240000	0.463832	8.5168	16320	4

Potential gains

Potential gains table

	RowCount	max	min	PotentialGainsinPercent
0	10000	0.002948	0.000285	90.3324
1	40000	0.004032	0.000573	85.7887
2	160000	0.008001	0.000843	89.4638
3	640000	0.026	0.003373	87.0269
4	2.56e+06	0.124572	0.008221	93.4006
5	5.12e+06	0.214994	0.014459	93.2747
6	1.024e+07	0.463832	0.03147	93.2152

References

Questions

For questions on your specific filtering case, kindly open a new topic and get individual support.

Topic		Replies	Views
https://forum.uipath.com/t/benchmarking-filtering-a-datatable-using-non-native-approaches-in-uipath-studio/782251 Other activities blog-or-website-post	0	6	December 10, 2024
[Benchmarking] - Filtering a datatable using native approaches in UiPath Studio Something Else datatable , bestpractices , linq , select , datatable-filter , for-each-row , filter-data-table , performance	2	85	November 25, 2024
Benchmarking Extract Transform Load (ETL) pipelines in UiPath Vote on Tutorials studio , powershell , cmd , benchmarking	1	1679	May 14, 2023
https://forum.uipath.com/t/benchmarking-filtering-a-datatable-using-native-approaches-in-uipath-studio/767526 Other activities blog-or-website-post	0	7	December 10, 2024
Efficiently Handling DataTables in UiPath Studio Studio datatable , excel-automation	3	399	December 31, 2023

Most Active Users - Yesterday
prashant1603765
ashokkarale
mively
anjasing
Yoichi
sonaliaggarwal47
lrtetala
V_Roboto_V
pikorpa
sharazkm32
More details...

[Benchmarking] - Filtering a datatable using non-native approaches in UiPath Studio

Integrations in UiPath Studio

CPU processing

Observations

First run

Potential gains

Second run

Potential gains

Third run

Potential gains

References

Questions

Related topics