## Wednesday, April 20, 2016

### Getting started with Intel's Threading Building Blocks: installing and compiling an example that uses a std::vector and the Code::Blocks IDE

I hit a problem where it looks like I'll have to parallelize execution. After reading a bit about it, I opted to use Intel's Threading Building Blocks over OpenMP. The consensus seemed to be that Intel TBB is better suited for slotting into C++ projects. TBB also has a bunch of birds on its webpage, which is a plus in my book.

 The most excellent TBB bird logo.

To download and install, I got the .tgz corresponding to the source code at the download page: https://www.threadingbuildingblocks.org/download

After unarchiving and decompressing, I invoked "make" in the directory (this was on an Ubuntu 14.04 OS). This built the debug and release targets in the "build" directory.

You can also invoke "make" in the "examples" directory; it builds and runs a number of fun examples that use TBB.

To set that up to compile with the Code::Blocks IDE, I did the following:

Go to the Project's Build Options -> Search Directories -> Linker and add the debug and release build directories (e.g. tbb44_20160128oss/build/linux_intel64_gcc_cc4.8_libc2.19_kernel3.19.0_debug).

Go to the Project's Build Options -> Search Directories -> Compiler and add the include directory (e.g. tbb44_20160128oss/include).

Go to the Project's Build Options -> Linker settings, and, under "Other linker options" add "-ltbb" (don't include the quotes).

To test the compilation with a simple use case, I modified one of the examples in the TBB parallel_for documentation. Here's the premise:

#include <iostream>
#include <vector>

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"

// Example adapted from: https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_for_func.htm
class AverageWithVectors {
public:
std::vector<float> *input;
std::vector<float> *output;
void operator()( const tbb::blocked_range<int>& range ) const {
for( int i=range.begin(); i!=range.end(); ++i )
(*output)[i] = ( (*input)[i-1] + (*input)[i] + (*input)[i+1]) * (1/3.f);
}
};

int main() {
std::vector<float> inputVector = {30.0, 70.0, 90.0, 120.0, 133.0, 199.0, 245.0, 266.0, 289.0};
std::vector<float> outputVector = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
AverageWithVectors avg;
avg.input = &inputVector;
avg.output = &outputVector;
tbb::parallel_for( tbb::blocked_range<int>( 1, 7 ), avg );
for (uint32_t i = 0; i < (*avg.output).size(); i++) {
std::cout << "output of element " << i << ": " << (*avg.output)[i];
}

return(0);
}


The output should be something like:
output of element 0: 0
output of element 1: 63.3333
output of element 2: 93.3333
output of element 3: 114.333
output of element 4: 150.667
output of element 5: 192.333
output of element 6: 236.667
output of element 7: 266.667
output of element 8: 0

Note that this doesn't actually prove anything is being executed in parallel, only that it compiles and successfully calls tbb::parallel_for().

Finally, I came across a helpful post on the Intel forums from a user named Jim Dempsey. Most of it is reproduced below.

Generally speaking:
Use for (or for_each) when the amount of computation is small (not worth the overhead to parallelize).
Use parallel_for when you have a large number of objects and each object has equal work of small work.
Use parallel_for with appropriate grain size when you have a large number of objects and each object has un-equal work of small work.
Use parallel_for_each when each objects work is relatively large and number of objects is few and each object has un-equal work.
When number of objects is very small and known, consider using switch(nObjects) and cases using parallel_invoke in each case that you wish to impliment and parallel_for_each for the default case