|My colleague who is working on optimization of quantitative calculations and plays around with Intel's thread building blocks (TBB) shared with me interesting performance results.|
Let's say you trying to do a simple math operation, e.g. sum numbers in arrays A and B into correspondent cells of array S.
The intuitive guess would be that this is exactly type of operation that would benefit form running it in parallel on few threads/processors. Well, it's not and here are results (test was executed on two Intel's core Dell laptop):
As you may see - running it using a straight loop is around four times faster than running it using TBB (selecting manual splitting of 1000 pieces) and around eight times faster (almost order of magnitude!) than running it using TBB and letting it employ automatic splitting heuristic.
Our guess is that processor is perfectly fine-tuned for such types of task (locality of reference, L2 cache, optimistic pre-fetching of commands from pipe etc). If you employ few processors you introduce "coordination overhead" and pay performance price for it.
It seems that TBB would provide performance benefits for tasks of a certain complexity balance - more complicated than described here, but still not "too complicate" so that coordination overhead is not too high...
Here is the code if you want to check it out.