Re: [PPL-devel] mpz_addmul_ui with ui = 1

Jim White wrote:
I have a loop that spends 99+% of its time executing these two statements:
mpz_addmul_ui(BB, B, a); mpz_swap (B, BB);
To my surprise, it runs 25% faster if I replace it with:
if (a==1) mpz_add (BB, B, BB); else mpz_addmul_ui(BB, B, a); mpz_swap(B, BB);
It's an RCF (regular continued-fraction) algorithm, so a=1 about 41.5% of the time.
No big deal, but perhaps in future revs mpz_addmul_ui and its sister function might auto-detect this case and reroute the call?
In the Parma Polyhedra Library, we found out that, for mpz_mul(), checking the source arguments against 0 gives rise to significant speedups in some algorithms.
We thought (but we are not sure) that this is indeed a feature, and not a limitation: since only the client code can know about the frequence of special cases, let the client code take the decision. The alternative would be that GMP does a lot of checks that are possibly useless, perhaps duplicating work that has already been done by the client code.
I am glad you raised the issue here and I hope Torbjorn will correct me if I misinterpreted the GMP's policies in this respect. All the best,
Roberto

On 2009-01-28 13:55:06 +0100, Roberto Bagnara wrote:
I am glad you raised the issue here and I hope Torbjorn will correct me if I misinterpreted the GMP's policies in this respect.
Is GMP's policy documented somewhere? If not, I think this should be documented, so that the user does not have to do useless checks if they are done by GMP.

Roberto Bagnara bagnara@cs.unipr.it writes:
We thought (but we are not sure) that this is indeed a feature, and not a limitation: since only the client code can know about the frequence of special cases, let the client code take the decision. The alternative would be that GMP does a lot of checks that are possibly useless, perhaps duplicating work that has already been done by the client code.
It is indeed very hard for GMP to know, and mpz_addmul_ui could perhaps also check for the arguments 16, 1023, and whatnot. Countless special values could give speedup when actually used, but testing for such values would slow most GMP usages down. And probably introduce hard-to-test-for bugs.
Adding test for special cases to application code is also not a perfect solution, since what values to test for varies from GMP version to GMP version and from hardware to hardware.
Vincent Lefevre vincent@vinc17.org writes:
Is GMP's policy documented somewhere? If not, I think this should be documented, so that the user does not have to do useless checks if they are done by GMP.
I have meant to write a "GMP cookbook" with advice like this.

Dear Roberto,
Date: Wed, 28 Jan 2009 13:55:06 +0100 From: Roberto Bagnara bagnara@cs.unipr.it
Jim White wrote:
I have a loop that spends 99+% of its time executing these two statements:
mpz_addmul_ui(BB, B, a); mpz_swap (B, BB);
To my surprise, it runs 25% faster if I replace it with:
if (a==1) mpz_add (BB, B, BB); else mpz_addmul_ui(BB, B, a); mpz_swap(B, BB);
It's an RCF (regular continued-fraction) algorithm, so a=1 about 41.5% of the time.
No big deal, but perhaps in future revs mpz_addmul_ui and its sister function might auto-detect this case and reroute the call?
In the Parma Polyhedra Library, we found out that, for mpz_mul(), checking the source arguments against 0 gives rise to significant speedups in some algorithms.
this surprises me, since 0 has a special representation in mpz, and thus is early detected in mpz_add and mpz_mul. Please can you give an example?
For an unsigned int argument (the 3rd argument 'a' above in mpz_addmul_ui) there is no special treatement for 0 (or 1) if I remember correctly, so it is indeed the responsibility of the user to check that case.
Paul Zimmermann

On Thu, 29 Jan 2009, Paul Zimmermann wrote:
In the Parma Polyhedra Library, we found out that, for mpz_mul(), checking the source arguments against 0 gives rise to significant speedups in some algorithms.
this surprises me, since 0 has a special representation in mpz, and thus is early detected in mpz_add and mpz_mul. Please can you give an example?
I had the same reaction, but it is not absurd. The difference between a check for 0 inline or in a function is already about a factor 4. And then you still get an extra factor because mpz_mul does a couple things before the actual check for 0. However, I have a very hard time imagining an application where you multiply by a mpz_t representing 0 often enough for this to be noticable.
For an unsigned int argument (the 3rd argument 'a' above in mpz_addmul_ui) there is no special treatement for 0 (or 1) if I remember correctly, so it is indeed the responsibility of the user to check that case.
It does test for 0, unless I misread the code. I don't think it tests for 1 though. It could make sense to add compile time tests (__builtin_constant_p), especially for C++, but they would not help here and obviously adding runtime checks makes the code slower in the general case, and then you never know where to stop (detect 1? detect powers of 2?).
(Not directly related) At some point I was wondering whether the special representation of 0 (size==0) should be extended to represent instead small numbers that can fit in an integer at most as long as _mp_d. Hard to tell.

Marc,
(Not directly related) At some point I was wondering whether the special representation of 0 (size==0) should be extended to represent instead small numbers that can fit in an integer at most as long as _mp_d. Hard to tell.
an alternate trick is to store in the mpz_t *pointer* a value of the form p=2k+1, where k is a small integer. Since pointers are usually even, you can easily distinguish between small integers and GMP integers, and you obtain the value k = p >> 1. It is also more memory-efficient since you don't allocate the mpz_t structure. A famous computer algebra system is using that trick together with GMP.
It might make sense to have another layer on top of mpz with that trick.
Paul Zimmermann

On 2009-01-30 18:41:51 +0100, Paul Zimmermann wrote:
an alternate trick is to store in the mpz_t *pointer* a value of the form p=2k+1, where k is a small integer. Since pointers are usually even, you can easily distinguish between small integers and GMP integers, and you obtain the value k = p >> 1. It is also more memory-efficient since you don't allocate the mpz_t structure. A famous computer algebra system is using that trick together with GMP.
Conversion between pointers and integers is implementation-defined. You need to make sure that pointers to integers are really even. On the ARM in 26-bit mode, the least 2 significant bits of the PC have a special meaning, and I'm not sure that a pointer to an integer is necessarily even (it may be difficult to know what happens after optimization).

On Fri, 30 Jan 2009, Paul Zimmermann wrote:
At some point I was wondering whether the special representation of 0 (size==0) should be extended to represent instead small numbers that can fit in an integer at most as long as _mp_d. Hard to tell.
an alternate trick is to store in the mpz_t *pointer* a value of the form p=2k+1, where k is a small integer. Since pointers are usually even, you can easily distinguish between small integers and GMP integers, and you obtain the value k = p >> 1. It is also more memory-efficient since you don't allocate the mpz_t structure. A famous computer algebra system is using that trick together with GMP.
Indeed. Both approaches have the advantage that you don't need to call malloc for small integers (save time and memory), and that you may be able to inline the small_int case of some operations.
The version I mentionned has the nice property that, except for a small increase in code size, it should not slow down anything except operations involving 0 (precisely the case the original poster wanted to speed up ;-). However, I just noticed that it requires deallocating memory when you set an existing number to 0 , which is wrong, or extend the meaning of size 1 to include number 0, which causes its own trouble, or shift the values of _mp_size to represent size+1, probably the best option. (There are always more ways to do things, for instance always allocating even sizes and using an odd _mp_alloc as meaning that the _mp_size is actually a value, etc) Ok, I am dropping it.
The approach you mention uses even less memory for small integers (factor 2 or 3) and can be implemented on top of mpz_t. The drawbacks are fairly small: it is not completely standard code, small integers have one less bit, you need a small manipulation to get the true value of the integer (>> 1 does not place the signbit at the right place, but that doesn't make those numbers much harder to handle), there are more indirections, and you need to malloc (or probably use a special allocator for this) the mpz_t. The last two problems should mostly be noticable if you work a lot with numbers that are just a few limbs long. Besides, this is successfully used by two softwares (assuming that the "famous algebra system" is not CLN).
Ok, I am convinced, thank you for mentionning this,

Paul Zimmermann wrote:
From: Roberto Bagnara bagnara@cs.unipr.it In the Parma Polyhedra Library, we found out that, for mpz_mul(), checking the source arguments against 0 gives rise to significant speedups in some algorithms.
this surprises me, since 0 has a special representation in mpz, and thus is early detected in mpz_add and mpz_mul. Please can you give an example?
Hi Paul,
sorry for the delay: I was busy and I wanted to check my facts again before replying. The example is in
http://www.cs.unipr.it/cgi-bin/cvsweb.cgi/~checkout~/ppl/src/MIP_Problem.cc?...
There you will find the two following fragments:
// FIXME(0.10.1): the test seems to speed up the GMP computation. if (tableau_ij != 0) { scalar_value = tableau_ij * norm_factor[i]; add_mul_assign(challenger_den, scalar_value, scalar_value); }
and
#if 1 // CHECKME: the test seems to speed up the GMP computation. const Coefficient& y_i = y[i]; if (y_i != 0) sub_mul_assign(x_i, y_i, normalized_x_k); #else sub_mul_assign(x_i, y[i], normalized_x_k); #endif // 1
This happens in our implementation of the simplex with unbounded arithmetics. As an example, if we leave the code as it is, solving the problem boeing1.mps (http://www.cs.unipr.it/cgi-bin/cvsweb.cgi/ppl/demos/ppl_lpsol/examples/boein...) the CPU time is 7.95 seconds; if we remove the tests against zero, the CPU time increases to 11.88 seconds. Timings have been taken on a machine equipped with an Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz, running Fedora 10, gmp-4.2.2-8.fc10.x86_64, GCC 4.3.2. The code for add_mul_assign() and sub_mul_assign() is below, where GMP_Integer is an alias for mpz_class:
inline void add_mul_assign(GMP_Integer& x, const GMP_Integer& y, const GMP_Integer& z) { mpz_addmul(x.get_mpz_t(), y.get_mpz_t(), z.get_mpz_t()); }
inline void sub_mul_assign(GMP_Integer& x, const GMP_Integer& y, const GMP_Integer& z) { mpz_submul(x.get_mpz_t(), y.get_mpz_t(), z.get_mpz_t()); }
If it can help, I am willing to perform the same test on different platforms, possibly with different versions of GMP and/or GCC. All the best,
Roberto

On 2009-02-02 22:30:03 +0100, Roberto Bagnara wrote:
#if 1 // CHECKME: the test seems to speed up the GMP computation. const Coefficient& y_i = y[i]; if (y_i != 0) sub_mul_assign(x_i, y_i, normalized_x_k); #else sub_mul_assign(x_i, y[i], normalized_x_k); #endif // 1
Couldn't this be due to C++ overhead? Have you tried in C?

On Mon, Feb 2, 2009 at 6:08 PM, Vincent Lefevre vincent@vinc17.org wrote:
On 2009-02-02 22:30:03 +0100, Roberto Bagnara wrote:
#if 1 // CHECKME: the test seems to speed up the GMP computation. const Coefficient& y_i = y[i]; if (y_i != 0) sub_mul_assign(x_i, y_i, normalized_x_k); #else sub_mul_assign(x_i, y[i], normalized_x_k); #endif // 1
Couldn't this be due to C++ overhead? Have you tried in C?
That would be surprising.
-- Gaby

On 2009-02-02 18:37:08 -0600, Gabriel Dos Reis wrote:
On Mon, Feb 2, 2009 at 6:08 PM, Vincent Lefevre vincent@vinc17.org wrote:
On 2009-02-02 22:30:03 +0100, Roberto Bagnara wrote:
#if 1 // CHECKME: the test seems to speed up the GMP computation. const Coefficient& y_i = y[i]; if (y_i != 0) sub_mul_assign(x_i, y_i, normalized_x_k); #else sub_mul_assign(x_i, y[i], normalized_x_k); #endif // 1
Couldn't this be due to C++ overhead? Have you tried in C?
That would be surprising.
Yes, but can anyone guess any non-surprising reason? :)
BTW, Roberto, are you sure that the speed-up is due to the test itself and not an arbitrary side effect because the code has just changed?

Dear Roberto,
// FIXME(0.10.1): the test seems to speed up the GMP computation. if (tableau_ij != 0) { scalar_value = tableau_ij * norm_factor[i]; add_mul_assign(challenger_den, scalar_value, scalar_value); }
inline void add_mul_assign(GMP_Integer& x, const GMP_Integer& y, const GMP_Integer& z) { mpz_addmul(x.get_mpz_t(), y.get_mpz_t(), z.get_mpz_t()); }
I'm ignorant about C++, but maybe the speedup you observe is due to the scalar_value = tableau_ij * norm_factor[i] instruction which is not done any more (which types have scalar_value, tableau_ij and norm_factor[i] ?), or maybe to some init/clear/conversion due to that instruction, or even to the xxx.get_mpz_t() calls.
Indeed, the code of mpz_aorsmul (which implements mpz_addmul) starts with:
mpz_aorsmul (mpz_ptr w, mpz_srcptr x, mpz_srcptr y, mp_size_t sub) { ...
xsize = SIZ(x); ysize = SIZ(y); if (xsize == 0 || ysize == 0) return;
My advice is to convert this part of the code to plain C calling directly the GMP routines, and/or moving scalar_value = tableau_ij * norm_factor[i] before the test.
Paul Zimmermann

On 2009-02-03 08:06:44 +0100, Paul Zimmermann wrote:
Dear Roberto,
// FIXME(0.10.1): the test seems to speed up the GMP computation. if (tableau_ij != 0) { scalar_value = tableau_ij * norm_factor[i]; add_mul_assign(challenger_den, scalar_value, scalar_value); }
inline void add_mul_assign(GMP_Integer& x, const GMP_Integer& y, const GMP_Integer& z) { mpz_addmul(x.get_mpz_t(), y.get_mpz_t(), z.get_mpz_t()); }
I'm ignorant about C++, but maybe the speedup you observe is due to the scalar_value = tableau_ij * norm_factor[i] instruction which is not done any more (which types have scalar_value, tableau_ij and norm_factor[i] ?),
But for the other test, there is no such assignment.

Paul Zimmermann ha scritto:
Dear Roberto,
// FIXME(0.10.1): the test seems to speed up the GMP computation. if (tableau_ij != 0) { scalar_value = tableau_ij * norm_factor[i]; add_mul_assign(challenger_den, scalar_value, scalar_value); }
inline void add_mul_assign(GMP_Integer& x, const GMP_Integer& y, const GMP_Integer& z) { mpz_addmul(x.get_mpz_t(), y.get_mpz_t(), z.get_mpz_t()); }
I'm ignorant about C++, but maybe the speedup you observe is due to the scalar_value = tableau_ij * norm_factor[i] instruction which is not done any more (which types have scalar_value, tableau_ij and norm_factor[i] ?), or maybe to some init/clear/conversion due to that instruction, or even to the xxx.get_mpz_t() calls.
Indeed, the code of mpz_aorsmul (which implements mpz_addmul) starts with:
mpz_aorsmul (mpz_ptr w, mpz_srcptr x, mpz_srcptr y, mp_size_t sub) { ...
xsize = SIZ(x); ysize = SIZ(y); if (xsize == 0 || ysize == 0) return;
My advice is to convert this part of the code to plain C calling directly the GMP routines, and/or moving scalar_value = tableau_ij * norm_factor[i] before the test.
I've just checked and running
ppl_lpsol -s -p1 -oobtained boeing1.mps
with that two checks to avoid calls to mpz_addmul and mpz_submul permits to avoid 110958265 calls to mpz_addmul (94.3%) and 50079219 calls to mpz_submul (88,6%).
With another small test program I've verified how much CPU time is needed to call 110958265 times mpz_addmul and 50079219 times mpz_submul with factors set to 0 and the result is similar to the benefit seen in ppl_lpsol.
This show me that the inlining of some trivial tests can give good results when the test is true very often.
I see some alternative policies:
1) the GMP library does not check for any special values unless this is needed for correctness and the user code is responsible to add the trivial tests that increase efficiency for specific data set
2) the GMP library move the trivial checks in gmp.h to permit their inlining. Note that this approach does not conflict with 1: it would be feasible to have mpz_addmul defined in gmp.h and mpz_addmul_raw defined in GMP library, then the user code could call the version more suitable for its needs, the important part is that library version does not check for special values.
3) (current situation) the GMP library does some special value checks and user code may decide to inline the very same checks. This makes the 'else' path heavier than needed.

On Wed, 4 Feb 2009, Abramo Bagnara wrote:
with that two checks to avoid calls to mpz_addmul and mpz_submul permits to avoid 110958265 calls to mpz_addmul (94.3%) and 50079219 calls to mpz_submul (88,6%).
Indeed those percentages are the interesting data. For such a distribution it is obviously more efficient to inline the frequent trivial case.
This show me that the inlining of some trivial tests can give good results when the test is true very often.
Ideally, this would be the compilers job. It would need: - link time optimization, to have access to the gmp code when compiling your program. Several compilers can do this, and maybe gcc-4.5 - profile feedback, to know about this distribution - partial inlining or arbitrary de-inlining
The last point seems like the hardest to find in a compiler, and writing the function as:
mpz_mul(...){ if(test for 0){trivial case} else{call mpz_mul_nonzero} } mpz_mul_nonzero(...){current mpz_mul code, minus the check for 0}
may help the compiler a lot.
- the GMP library does not check for any special values unless this is
needed for correctness and the user code is responsible to add the trivial tests that increase efficiency for specific data set
- the GMP library move the trivial checks in gmp.h to permit their
inlining. Note that this approach does not conflict with 1: it would be feasible to have mpz_addmul defined in gmp.h and mpz_addmul_raw defined in GMP library, then the user code could call the version more suitable for its needs, the important part is that library version does not check for special values.
Indeed, with this approach, the compiler does not even need LTO or profiling. However, this is making the code a bit heavier by forcing those inlines for every operation, even for codes where the numbers are never 0. Don't know how much that matters.
- (current situation) the GMP library does some special value checks
and user code may decide to inline the very same checks. This makes the 'else' path heavier than needed.
I believe that the check for 0 in mpz_mul actually falls in your first category, it is necessary for correctness (however, mpz_mul also checks whether one of the arguments fits in a single limb, which fits your description better). And the else path is supposedly already heavy enough that the penalty should not be too bad. Did you see a significant performance gain using 2) instead of 3)?
(I am not a gmp developer, just another user, so feel free to ignore my questions if you think the answers won't help convince the real developers)

Marc Glisse ha scritto:
I believe that the check for 0 in mpz_mul actually falls in your first category, it is necessary for correctness (however, mpz_mul also checks whether one of the arguments fits in a single limb, which fits your description better). And the else path is supposedly already heavy enough that the penalty should not be too bad. Did you see a significant performance gain using 2) instead of 3)?
It depends from data set and optimizations done inside the library: suppose we enter in mpz_mul 99% of the times with both operands below 1<<(sizeof(mp_limb_t)*4) and this is treated as a special case, then I think that to avoid two extra comparison can make a difference.
My point is that to maximize performance the unavoidable path should be the least computationally expensive.
participants (7)
-
Abramo Bagnara
-
Gabriel Dos Reis
-
Marc Glisse
-
Paul Zimmermann
-
Roberto Bagnara
-
Torbjorn Granlund
-
Vincent Lefevre