Monday, July 20, 2009

TBB Performance

My colleague who is working on optimization of quantitative calculations and plays around with Intel's thread building blocks (TBB) shared with me interesting performance results.

Let's say you trying to do a simple math operation, e.g. sum numbers in arrays A and B into correspondent cells of array S.
The intuitive guess would be that this is exactly type of operation that would benefit form running it in parallel on few threads/processors. Well, it's not and here are results (test was executed on two Intel's core Dell laptop):

As you may see - running it using a straight loop is around four times faster than running it using TBB (selecting manual splitting of 1000 pieces) and around eight times faster (almost order of magnitude!) than running it using TBB and letting it employ automatic splitting heuristic.

Our guess is that processor is perfectly fine-tuned for such types of task (locality of reference, L2 cache, optimistic pre-fetching of commands from pipe etc). If you employ few processors you introduce "coordination overhead" and pay performance price for it.

It seems that TBB would provide performance benefits for tasks of a certain complexity balance - more complicated than described here, but still not "too complicate" so that coordination overhead is not too high...

Here is the code if you want to check it out.


Anonymous said...

This benchmark does not prove anything except that there is an overhead creating tasks/threads. A sum is not the kind of operation you want to parallelize. Its only a few cycles to perform, so it's not worth spawning threads for that.

Alex Pinsker said...

I didn’t had a purpose to prove that TBB is good or bad per se here. I mention that TBB should be useful for ‘tasks’ exceeding some complexity threshold, so our conclusion is the same. Still, it’s counter-intuitive that summing long array using few threads should be less efficient than doing it on a single thread since:
1) Despite sum is just few instructions long – the fact that it’s applied to a big chunk of data should matter as well.
2) Spawning new thread shouldn’t be very heavy – I would expect that some form of thread pool should be used by TBB.

Anonymous said...

You should use intel IPP, there is a function to sum two vectors ;)
optimized with MMX SSE registers

Alex Pinsker said...

My purpose was just make a comparison using some fair example, rather than find solution optimized for this specific task.

Anonymous said...

Same code compiled with g++ version 4.5.2, using tbb version 4.0, and running on athlon dual core gives roughly the same time measurement for all the three ways (with a slight advantage to manual splitting over others). My runtime is around 10ms (much more than yours, btw). We may still consider there is an overhead though, since there is almost no speedup running with 2 threads.