Hi Enea,
I have installed the developmental version of ppl and configured it with thread-safety on. It seems to work just as you say it will, but I am having issues getting the expected speedups. To demonstrate the speedup issue, I have included a sample program below. This program creates a user inputted number of threads, and in each thread it intersects two NNC_Polyhedron a user inputted number of times. For timing comparisons, I also made a code path in the test program that does not call PPL but rather computes logarithms.
#include <ppl.hh>
#include "Thread_Pool_defs.hh"
using namespace Parma_Polyhedra_Library;
namespace Parma_Polyhedra_Library {using IO_Operators::operator<<;}
using namespace std;
void TestIntersections(int RepCount, bool TestPPL) {
double x = 10.0;
double b = 0.0;
for (size_t i = 0; i != j; k++) {
if (TestPPL == false) {
x += i;
b += log(x);
} else {
Variable x0(0);Variable x1(1);Variable x2(2);Variable x3(3);Variable x4(4);
Variable x5(5);Variable x6(6);Variable x7(7);Variable x8(8);Variable x9(9);
Constraint_System cs1;
cs1.insert(x8-x9==0);cs1.insert(x2-x9>=0);cs1.insert(x3-x9>=0);
cs1.insert(x4-x9>=0);cs1.insert(x5-x9>=0);cs1.insert(x1-x9>=0);
cs1.insert(x6-x9>=0);cs1.insert(x7-x9>=0);cs1.insert(x0-x9>=0);
NNC_Polyhedron ph1(cs1);
Constraint_System cs2;
cs2.insert(x7-x9==0);cs2.insert(x2+x3-x8-x9>=0);cs2.insert(x1+x2-x8-x9>=0);
cs2.insert(x3+x4-x8-x9>=0);cs2.insert(x0-x8>=0);cs2.insert(x5+x6-x8-x9>=0);
cs2.insert(x6-x8>=0);cs2.insert(x0+x1-x8-x9>=0);cs2.insert(x4+x5-x8-x9>=0);
NNC_Polyhedron ph2(cs2);
NNC_Polyhedron ph3(cs1);
ph2.add_constraints(ph2.minimized_constraints());
ph2.minimized_constraints();
ph2.affine_dimension();
};
};
}
int main(int argc, char* argv[]) {
int TotalProcessCount = atoi(argv[1]);
int RepCount = atoi(argv[2]);
bool TestPPL = atoi(argv[3]);
typedef std::function<void()> work_type;
Thread_Pool<work_type> thread_pool(TotalProcessCount);
for (size_t i = 0; i != TotalProcessCount; i++) {
work_type work = std::bind(TestIntersections, RepCount, TestPPL);
thread_pool.submit(make_threadable(work));
};
thread_pool.finalize();
return 0;
}
This is how I compiled:
g++ -std=c++11 -pthread file_name.cpp -l:libtcmalloc_minimal.so.4.2.6 -lppl -lgmpxx -lgmp
I tested this on a new machine with 44 cores and hyperthreading (thread::hardware_concurrency() = 88), run with RepCount = 10,000 and TestPPL = true. Here are the timings:
#thread,real time (from time)
1,0m0.925s
5,0m1.820s
10,0m3.041s
20,0m3.758s
40,0m6.775s
By way of comparison, here are the timings for RepCount = 50,000,000 and TestPPL = false:
#thread,real time (from time)
1,0m1.767s
5,0m1.854s
10,0m2.012s
20,0m2.139s
40,0m2.206s
Assuming sufficient hardware, I would expect it to take the same amount of time for 1 thread as 40 threads, though I know that that is not quite realistic. Am I doing something incorrectly in the PPL code branch that is causing it to slow down so much as the number of threads increases? I am not very experienced with parallel C++ programming, so please forgive me if I am doing something foolish. Thanks so much for all of the help.
Best,
Jeff