Thread Safety (continued)

Hi Enea and Roberto,
Thank you very much for all of the work you have done on threading with PPL. I have a couple of questions about the current status of the thread safety. A project I am currently working on needs to have multiple threads working simultaneously on distinct PPL objects. Is that possible with the new thread safe version?
I made an implementation of my current project where I fork processes instead of using multiple threads. However, if I fork and have two processes manipulating distinct PPL objects, I do not get a speedup by a factor of two like I would expect. Is there anyway to get that type of speedup?
Best, John C. Paulson

Hello John.
On 10/07/2016 06:54 PM, John Paulson wrote:
Hi Enea and Roberto,
Thank you very much for all of the work you have done on threading with PPL. I have a couple of questions about the current status of the thread safety. A project I am currently working on needs to have multiple threads working simultaneously on distinct PPL objects. Is that possible with the new thread safe version?
Yes, it is possible.
If the PPL objects are distinct, so that there is no concurrent access to the *same* object, then things should not be difficult (e.g., no need at all for synchronization). You can have a look to the ppl_lcdd and ppl_lpsol demos or to the recently added tests in tests/Polyhedron/threadsafe*.cc
I made an implementation of my current project where I fork processes instead of using multiple threads. However, if I fork and have two processes manipulating distinct PPL objects, I do not get a speedup by a factor of two like I would expect. Is there anyway to get that type of speedup?
If you use many process, then the PPL should be working fine "as is" (i.e., with thread-safety off). As for the missing speedup, I doubt it has something to do with the library: there should be something else going on, but I have no information to guess what it could be.
Cheers, Enea.
Best, John C. Paulson
PPL-devel mailing list PPL-devel@cs.unipr.it http://www.cs.unipr.it/mailman/listinfo/ppl-devel

Hi Enea,
I have installed the developmental version of ppl and configured it with thread-safety on. It seems to work just as you say it will, but I am having issues getting the expected speedups. To demonstrate the speedup issue, I have included a sample program below. This program creates a user inputted number of threads, and in each thread it intersects two NNC_Polyhedron a user inputted number of times. For timing comparisons, I also made a code path in the test program that does not call PPL but rather computes logarithms.
#include <ppl.hh> #include "Thread_Pool_defs.hh"
using namespace Parma_Polyhedra_Library; namespace Parma_Polyhedra_Library {using IO_Operators::operator<<;} using namespace std;
void TestIntersections(int RepCount, bool TestPPL) { double x = 10.0; double b = 0.0; for (size_t i = 0; i != j; k++) { if (TestPPL == false) { x += i; b += log(x); } else { Variable x0(0);Variable x1(1);Variable x2(2);Variable x3(3);Variable x4(4); Variable x5(5);Variable x6(6);Variable x7(7);Variable x8(8);Variable x9(9); Constraint_System cs1; cs1.insert(x8-x9==0);cs1.insert(x2-x9>=0);cs1.insert(x3-x9>=0); cs1.insert(x4-x9>=0);cs1.insert(x5-x9>=0);cs1.insert(x1-x9>=0); cs1.insert(x6-x9>=0);cs1.insert(x7-x9>=0);cs1.insert(x0-x9>=0); NNC_Polyhedron ph1(cs1); Constraint_System cs2; cs2.insert(x7-x9==0);cs2.insert(x2+x3-x8-x9>=0);cs2.insert(x1+x2-x8-x9>=0); cs2.insert(x3+x4-x8-x9>=0);cs2.insert(x0-x8>=0);cs2.insert(x5+x6-x8-x9>=0); cs2.insert(x6-x8>=0);cs2.insert(x0+x1-x8-x9>=0);cs2.insert(x4+x5-x8-x9>=0); NNC_Polyhedron ph2(cs2); NNC_Polyhedron ph3(cs1); ph2.add_constraints(ph2.minimized_constraints()); ph2.minimized_constraints(); ph2.affine_dimension(); }; }; }
int main(int argc, char* argv[]) { int TotalProcessCount = atoi(argv[1]); int RepCount = atoi(argv[2]); bool TestPPL = atoi(argv[3]); typedef std::function<void()> work_type; Thread_Pool<work_type> thread_pool(TotalProcessCount); for (size_t i = 0; i != TotalProcessCount; i++) { work_type work = std::bind(TestIntersections, RepCount, TestPPL); thread_pool.submit(make_threadable(work)); }; thread_pool.finalize(); return 0; }
This is how I compiled: g++ -std=c++11 -pthread file_name.cpp -l:libtcmalloc_minimal.so.4.2.6 -lppl -lgmpxx -lgmp
I tested this on a new machine with 44 cores and hyperthreading (thread::hardware_concurrency() = 88), run with RepCount = 10,000 and TestPPL = true. Here are the timings: #thread,real time (from time) 1,0m0.925s 5,0m1.820s 10,0m3.041s 20,0m3.758s 40,0m6.775s
By way of comparison, here are the timings for RepCount = 50,000,000 and TestPPL = false: #thread,real time (from time) 1,0m1.767s 5,0m1.854s 10,0m2.012s 20,0m2.139s 40,0m2.206s
Assuming sufficient hardware, I would expect it to take the same amount of time for 1 thread as 40 threads, though I know that that is not quite realistic. Am I doing something incorrectly in the PPL code branch that is causing it to slow down so much as the number of threads increases? I am not very experienced with parallel C++ programming, so please forgive me if I am doing something foolish. Thanks so much for all of the help.
Best, Jeff
On Sat, Oct 8, 2016 at 4:15 AM, Enea Zaffanella zaffanella@cs.unipr.it wrote:
Hello John.
On 10/07/2016 06:54 PM, John Paulson wrote:
Hi Enea and Roberto,
Thank you very much for all of the work you have done on threading with PPL. I have a couple of questions about the current status of the thread safety. A project I am currently working on needs to have multiple threads working simultaneously on distinct PPL objects. Is that possible with the new thread safe version?
Yes, it is possible.
If the PPL objects are distinct, so that there is no concurrent access to the *same* object, then things should not be difficult (e.g., no need at all for synchronization). You can have a look to the ppl_lcdd and ppl_lpsol demos or to the recently added tests in tests/Polyhedron/threadsafe*.cc
I made an implementation of my current project where I fork processes instead of using multiple threads. However, if I fork and have two processes manipulating distinct PPL objects, I do not get a speedup by a factor of two like I would expect. Is there anyway to get that type of speedup?
If you use many process, then the PPL should be working fine "as is" (i.e., with thread-safety off). As for the missing speedup, I doubt it has something to do with the library: there should be something else going on, but I have no information to guess what it could be.
Cheers, Enea.
Best, John C. Paulson
PPL-devel mailing listPPL-devel@cs.unipr.ithttp://www.cs.unipr.it/mailman/listinfo/ppl-devel

Hello John.
On 10/26/2016 01:30 AM, John Paulson wrote:
Hi Enea,
I have installed the developmental version of ppl and configured it with thread-safety on. It seems to work just as you say it will, but I am having issues getting the expected speedups. To demonstrate the speedup issue, I have included a sample program below. This program creates a user inputted number of threads, and in each thread it intersects two NNC_Polyhedron a user inputted number of times. For timing comparisons, I also made a code path in the test program that does not call PPL but rather computes logarithms.
[...]
I tested this on a new machine with 44 cores and hyperthreading (thread::hardware_concurrency() = 88), run with RepCount = 10,000 and TestPPL = true. Here are the timings: #thread,real time (from time) 1,0m0.925s 5,0m1.820s 10,0m3.041s 20,0m3.758s 40,0m6.775s
First of all, let me warn you that I am NOT an expert of parallel programming, so everything I will say should be taken with a grain of salt.
I have no easy access to a machine with hardware coming close to yours, so I just tried your code on my laptop: this has only 4 cores on a single socket, i.e., 8 as hardware_concurrency. I applied a few trivial corrections in a couple of places to make it compile; I also added -O3 to the compilation command, just to be sure.
Here is my "baseline", when using a single worker thread:
# time LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" ./paulson 1 10000 1 real 0m1.174s user 0m1.169s
Here is the "best" time obtained when using 4 workers:
# time LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" ./paulson 4 10000 1 real 0m1.487s user 0m5.841s
In this best case, I am obtaining an "efficiency" of ~ 79% (speedup ~ 3.16).
Note however that in order to get that timing I had to run the test some 4 or 5 times ... it is quite easy on my laptop to obtain timings in the 2.0-2.5 secs range (i.e., efficiency < 60%, speedup < 2.35). However, as far as I recall, this is the way these speedups are usually computed: people always report their best results, blaming the operating system when the program executes slower than that.
So, in my best case, there is an overhead, but not as big as the one you are reporting (question: were you reporting your best timings?)
I know that testing with 4 threads is quite limited ... but I can't go beyond that without bias: for instance, if I use 8 threads (i.e., hardware concurrency), on each of my 4 cores I will have at least two threads competing for the shared L1 & L2 caches (the L3 cache is anyway shared among all cores). Of course, you already know this, since you were doing your tests with fewer than 44 threads.
Now, the other part of your question:
By way of comparison, here are the timings for RepCount = 50,000,000 and TestPPL = false: #thread,real time (from time) 1,0m1.767s 5,0m1.854s 10,0m2.012s 20,0m2.139s 40,0m2.206s
In this case, I am obtaining 2.09 secs for 1 worker and 2.26 secs for 4 workers (speedup ~ 3.7). Again, best case ... worst cases reach ~ 2.6 secs, speedup 3.2. The variation between best&worst case is smaller wrt the PPL case. Looking at the code, the big difference between the two tests is that the PPL tests is probably performing many allocations from the heap, whereas the non-PPL test seems to be a pure CPU intensive task.
Of course, this is just wild guessing. I tried to get some more info using the perf tool. I started with "perf stat", comparing the single worker wrt the 4 workers cases.
For the PPL test I was obtaining the following: ================================================= # LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat ./paulson 1 10000 1
Performance counter stats for './paulson 1 10000 1':
1167.033565 task-clock (msec) # 0.998 CPUs utilized 107 context-switches # 0.092 K/sec 3 cpu-migrations # 0.003 K/sec 671 page-faults # 0.575 K/sec 3,666,223,020 cycles # 3.141 GHz 983,894,620 stalled-cycles-frontend # 26.84% frontend cycles idle <not supported> stalled-cycles-backend 7,803,455,611 instructions # 2.13 insns per cycle # 0.13 stalled cycles per insn 1,553,152,858 branches # 1330.855 M/sec 7,428,588 branch-misses # 0.48% of all branches
1.169603522 seconds time elapsed =================================================
# LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat ./paulson 4 10000 1
Performance counter stats for './paulson 4 10000 1':
5851.363612 task-clock (msec) # 3.944 CPUs utilized 922 context-switches # 0.158 K/sec 27 cpu-migrations # 0.005 K/sec 993 page-faults # 0.170 K/sec 16,867,156,201 cycles # 2.883 GHz 6,529,193,696 stalled-cycles-frontend # 38.71% frontend cycles idle <not supported> stalled-cycles-backend 31,184,317,457 instructions # 1.85 insns per cycle # 0.21 stalled cycles per insn 6,204,550,206 branches # 1060.360 M/sec 29,955,117 branch-misses # 0.48% of all branches
1.483624741 seconds time elapsed =================================================
We can see a relevant increase (more than 4x) in the number of context switches and cpu migrations (as well as an increase on the percentage of stalled cycles). Another thing to monitor is the number of cache misses:
================================================ ### LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat -e L1-dcache-load-misses ./paulson 1 10000 1
Performance counter stats for './paulson 1 10000 1':
14,692,004 L1-dcache-load-misses
1.179272479 seconds time elapsed ================================================ ### LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4" perf stat -e L1-dcache-load-misses ./paulson 4 10000 1
Performance counter stats for './paulson 4 10000 1':
192,535,509 L1-dcache-load-misses
1.502333756 seconds time elapsed ================================================
The factor here is ~ 13, much higher than 4.
In contrast, the test not using the PPL, being CPU intensive and not memory intensive, scores as follows: 1 worker: 171,187 L1-dcache-load-misses 4 workers: 395,932 L1-dcache-load-misses
As I said in the beginning, I am not an expert and hence I may have missed something trivial, making everything I said above "just junk". Anyway, this investigation was interesting. Maybe you will want to perform a similar analysis on your machine: in that case your feedback is welcome.
Cheers, Enea.
participants (2)
-
Enea Zaffanella
-
John Paulson