LINQ query running for long hours

Hi @ppr

Below is the xaml file

MatchData.xaml (13.3 KB)

Note - I have used package UiPathTeam.XLExcel.Activites package for reading extra large excel files

Please suggest! Also let me know is you need something else.

I have also started testing that empty loop approach on my system.

Thanks

Perfect, the xaml is minimally modeled.

now go into the different rounds as mentioned above:

Round 1 - do nothing, only loop:

grafik

Round 2 - loop and set a hardcoded value
grafik

The intention of this is to get a feeling of the general performance and to isolate the low performance action (sure we do feel that the match check will get more attraction in the next rounds)

Let us know the result. Feel free to do a surrounding with Stop watch:



myStopWatch.Elapsed.ToString("hh\:mm\:ss\.fff")

Hi @ppr

I worked on the things you mentioned

Both rounds were fairly quick .

Round 1 - Empty parallel for each loop

image

Result -
image

Round 2 - Hardcoded value

image

Result

image

Please suggest further.

Thanks !!

Perfect, now lets explore the performance break:
(using the original implementation from above e.g. your XAML)
grafik

Round 1 -
dtOldSet.AsEnumerable.Take(1)

Round 2 -
dtOldSet.AsEnumerable.Take(2)

Round 3 -
dtOldSet.AsEnumerable.Take(25)

We would like to get a feeling of the parallelization effort and the behaviour of the contains check

Again let us know the timings from the stopwatch.elapsed … log

Hi @ppr

I ran the other rounds as u suggested below are the results

Round 1 - dtOldSet.AsEnumerable.Take(1)

image

Round 2 -dtOldSet.AsEnumerable.Take(2)

image

Round 3 -dtOldSet.AsEnumerable.Take(25)

image

Thanks!!

it is confusing why the fill job /including the contains check and update data column
grafik

is faster as the do nothing / update hard coded value

But now lets increase the segments to take 100,250,500,1000,5000 …
and explore

BTW: the execution Bot machine is it well equipped (e.g. RAM, Processor) and can it be the case that other processes are running and consuming power?

Hi @ppr

I Increased the segments, below are the results

Round 1 - For 100

image

Round 2 - For 250, please ignore the value 416125

image

Round 3 - For 500

image

Round 4 - for 1000

image

Round 5 - for 5000

image

While I run the bot in full fledged mode I close other processes so main things consuming major memory/RAM and running are UiPath, robot only

System specs - Processor - Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.59 GHz
RAM - 16 GB

Thanks!!

Perfect estimatiing the 5000 set to a 400.000 set we would come to numbers below of 2-3 hrs, but taking some time as well

We identified that the contains check is more to attack for the tuning.

Do we have duplicates within the dtNewSet? Do we have entries within the dtNewSet which do not have a match in dtOldSet? What is a typical/average string length when all 8 cols are concatenated?

For preparing alternates for the match check let us also check following (surrounding with the stop watch and its measuring is helping much and is recommended).

For starting an exploration for alternates also check how long it takes to create a lookup Dictionary

dictNewSetLK | Dictionary (of String, Boolean) =
dtNewSet.asEnumerable.Select(Function (x) String.Join(“#”, arrColSet.Select(Function (k) x(k).toString.ToLower.Trim))).Distinct().toDictionary(Function (x) x, Function (x) True)

The idea about is get some faster checks when using the dictionary.ContainsKey vs List.Contains

Hi @ppr

Please find below response -

  • Yes we can have duplicates in dtNewSet, FYI as per above implementation we used a distinct function in the list you said to create

  • Yes, there can be such entries.
  • When all cols are calc. typical string length can vary from 500 to 1500 words for both dtOldSet and dtNewSet

Also I have 1 question on below

does the above query needs to be in parallel loop or just you require the stopwatch execution time of this query when used in a single assign variable?

Thanks

Hi @ppr

I tried the above code in an assign activity, below is the result

image

Code ran -

Also, the answer to these below questions you asked are provided in the above post. Post before this one.

Thanks!!!

Hi @ppr

Can you please suggest further?
I have answered your queries in the above 2 posts

Thanks!!!

yes, we are working on it. Just give us a little time. Thanks

Yes Sure.

Thanks a lot for researching on this and helping me.
Really appreciate.

Thanks

Let us introduce the next changes

we are creating lists with the concatenated strings from dtold, dtnew
and creating a dictionary of only the items common in both lists

dtNewSet.asEnumerable.Select(Function (x) String.Join(“#”, arrColSet.Select(Function (k) x(k).toString.ToLower.Trim))).Distinct().toList
dtOldSet.asEnumerable.Select(Function (x) String.Join(“#”, arrColSet.Select(Function (k) x(k).toString.ToLower.Trim))).Distinct().toList
ListDT1.Intersect(ListDT2).ToDictionary(Function (x) x, Function (x) true)

then we simplified the processing by keeping the ja string
and using the containskey result for the column value update.
(it is writing now “True”, “False” instead of yes, no (we are interested in speed currently)

Important to know about the intersect, it is is deduplicating as well. This needs to be respected when we need to keep it as well ( find matches)
grafik

Summary of the last optmization actions:

  • reduce the checklist
  • shift to dictionary containsKey as we want to check if this will be faster instead of list.contains

Result: Reducing the execution time from 24+ hrs to 10-20 secs


Post edit PREVIEW

General Analysis Pattern for Performance Optimization

When the execution time is to optimize following general actions can be applied:

Cleansing

  • remove any unneeded actions e.g. Log messages / Write Line… from the core block, which needs to be optimized on its execution

Measurement

  • surround the core block with a Stopwatch and trace the execution time
  • run the core block and cancel it, when the execution is taking too long time

Data Volume Reduction

  • Apply techniques like using take(x), reduced Test Data Set to retrieve an initial understanding of the core block execution timings and interpolate it to the full data volume

Parallelization

  • check and enable the parallel execution of the core block if possible. Maybe the core block has to be rewritten or to be modified

Isolation

  • Isolate the time consuming parts within the core block

Optimization

  • replace the time consuming parts with alternates and run again test series

Finalization

  • Once the optimized format / core block implementation is found run and measure it on the entire data volume
1 Like

Hi @ppr

I ran the code you provided in the xaml.

Below is the execution result

image

Thanks!!

Hi @ppr

Thanks a lot for your efforts.

I have few questions.

  1. We are using Intersect like you mentioned, post intersect it will only keep the unique elements. I wanted to know as I have duplicate data in both dtNewSet an dtOldSet and want the duplicates to be there. Our this processing will not remove duplicates?
  1. Currently the bot the logic is writing True/False, how can we change to Yes/No in the code please suggest.

  2. We are running a parallel for each which runs asynchronously and not in a sequential
    manner, will the results be the same and correct which we might obtain from the LINQ query?

Also, if we need to take some more steps forward on implementation part please guide.

Thanks!!

Hi @ppr

Can you please help/guide on the above question thread?

Thanks!

Let us first summarize - we progressed and reduced from 24+hrs to seconds, right?

we are removing duplicates within the checklist
we are not removing rows from dtOldSet
you can also check the rows count dtOldSet.Rows.Count after the run

As we progressed we can maybe invest some miliseconds for the yes / no
If(dictLK.ContainsKey(ja),"yes", "no")
grafik

For answering the question in detail we are forced to look at few technical things also behind the scenes. We are expecting that the parallel for each will work as intended.

If you will find a inconsistency then share it also with the UiPath Techteam for a replication and analysis round.

Can we finalize the topic?

Hi @ppr

Thanks for the clarification.

On Summary we are able to make process efficient from hours to minutes.

  • we tried list approach and dictionary approach to match the data between tables
  • The dictionary approach worked far more better than list approach

Big thanks for working on the issue and helping on it.
I will let you know if I face any issues in this processing logic wise or on any other issues when I verify the final results.

Thanks!!

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.