Loading...

gcc@gcc.gnu.org
[Prev] Thread [Next]  [Prev] Date [Next]
Re: Inefficient endofloop value computation  missed optimization somewhere? Andrew Pinski Tue Feb 21 13:00:18 2012
On Tue, Feb 21, 2012 at 1:27 AM, Richard Guenther <[EMAIL PROTECTED]> wrote: > On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <[EMAIL PROTECTED]> wrote: >> Hello, >> >> we've noticed that the loop optimizer sometimes inserts weirdly >> inefficient code to compute the value of an induction variable >> at the end of the loop. >> >> As a test case stripped down to the core of the problem, consider: >> >> int test (int start, int end) >> { >> int i; >> >> for (i = start; i < end; i++) >> ; >> >> return i; >> } >> >> That function really ought to be equivalent to >> >> int test2 (int start, int end) >> { >> return start < end ? end : start; >> } >> >> And indeed on i386linuxgnu, we get mostly identical code >> (except for the branch direction prediction). >> >> However, on armlinuxgnuabi (using mthumb and march=armv7a), >> we see for the first function: >> >> test: >> cmp r0, r1 >> bge .L2 >> adds r3, r0, #1 >> mvns r0, r0 >> adds r1, r1, r0 >> adds r0, r3, r1 >> .L2: >> bx lr >> >> instead of simply (as for the second function): >> >> test2: >> cmp r1, r0 >> it ge >> movge r0, r1 >> bx lr >> >> >> >> Where does that weird additionandnegation sequence come from? >> In 100t.dceloop we still have: >> >> <bb 4>: >> # i_9 = PHI <i_5(5), start_2(D)(3)> >> i_5 = i_9 + 1; >> if (end_4(D) > i_5) >> goto <bb 5>; >> else >> goto <bb 6>; >> >> <bb 5>: >> goto <bb 4>; >> >> <bb 6>: >> # i_1 = PHI <i_5(4)> >> >> >> But 102t.sccp has: >> >> <bb 4>: >> # i_9 = PHI <i_5(5), start_2(D)(3)> >> i_5 = i_9 + 1; >> if (end_4(D) > i_5) >> goto <bb 5>; >> else >> goto <bb 6>; >> >> <bb 5>: >> goto <bb 4>; >> >> <bb 6>: >> D.4123_10 = start_2(D) + 1; >> D.4124_11 = ~start_2(D); >> D.4125_12 = (unsigned int) D.4124_11; >> D.4126_13 = (unsigned int) end_4(D); >> D.4127_14 = D.4125_12 + D.4126_13; >> D.4128_15 = (int) D.4127_14; >> i_1 = D.4123_10 + D.4128_15; >> >> This is the sequence that ends up in the final assembler code. >> >> Note that this computes: >> >> i_1 = (start_2(D) + 1) >> + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D)) >> >> which is (since those casts are noops on the assembler level): >> >> i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D) >> >> which is due to two'scomplement arithmetic: >> >> i_1 = start_2(D)  start_2(D) + end_4(D) >> >> that is simply: >> >> i_1 = end_4(D) >> >> >> Unfortunately, no treebased optimization pass after 102t.sccp is able to >> clean up that mess. This is true even on *i386*: the only reason we get >> nice assembler there is due to *RTX* optimization, notably in combine. >> This hits on i386 because of an intermediate stage that adds two registers >> and the constant 1 (which matches the lea pattern). On arm, there is no >> instruction that would handle that intermediate stage, and combine is not >> powerful enough to perform the whole simplification in a single step. >> >> >> Now that sequence of gimple operations is generated in scev_const_prop >> in treescalarevolution.c. First, the phi node corresponding to the >> loop exit value is evaluated: >> >> def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL); >> >> which returns a chrec of the form >> >> { 1, +, (start + 1) } >> >> This is then further simplified by >> >> def = compute_overall_effect_of_inner_loop (ex_loop, def); >> >> which first computes the number of loop iterations: >> >> tree nb_iter = number_of_latch_executions (loop); >> >> which returns a tree representing >> >> (unsigned int) ~start + (unsigned int) end >> >> (which at this point seems to be the validly most efficient way to >> compute end  start  1). >> >> The chrec is then evaluated via chrec_apply which basically computes >> >> (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end) >> >> which ends up being converted to the long gimple sequence seen above. >> >> >> Now I guess my questions are: >> >>  In the computation shown above, should that tree have been folded >> into just "end" to start with? The tree is constructed at its > > The issue is that (start + 1) + 1 * (int) ... is computed using signed > integer arithmetic which may invoke undefined behavior when it wraps. > Thus we cannot reassociate it. I suppose if you'd arrange the code > to do the arithmetic in unsigned and only cast the result back to the > desired type we might fold it ... > >> outermost level in chrec_fold_plus, which simply calls fold_build2. >> While simplifyrtx.c has code to detect sequences of operations >> that cancel out while folding a PLUS or MINUS RTX, there does >> not appear to be corresponding code in foldconst.c to handle >> the equivalent optimization at the tree level. > > There are. foldconst.c has some reassociation logic to catch this > and we have treessareassoc.c. All of those only handle samesign > operands though I think. > >>  If not, should there be another tree pass later on that ought to >> clean this up? I notice there is code to handle plus/minus >> sequences in treessareassoc.c. But this doesn't seem to be >> able to handle this particular case, possibly because the presence >> of the casts (nop_exprs) ... >> >>  Anywhere else I'm overlooking right now? >> >> >> I'd appreciate any tips / suggestions how to fix this ... > > Look at doing the computation in unsigned arithmetic. I have some patches which improve this but it is more of a cleanup and it is not fully done yet. Right now we catch the case in gfortran.dg/reassoc_6.f which has the above pattern but only in VPR and only during the simplifing phase, I am working on improving that to get to the point where we can recognize this while getting the range in VPR. Thanks, Andrew Pinski
 Inefficient endofloop value computation  missed optimization somewhere? Ulrich Weigand 2012/02/20
 Re: Inefficient endofloop value computation  missed optimization somewhere? Richard Guenther 2012/02/21
 Re: Inefficient endofloop value computation  missed optimization somewhere? Andrew Pinski 2012/02/21 <=
 [WIP PATCH] Re: Inefficient endofloop value computation  missed optimization somewhere? Ulrich Weigand 2012/02/28
 Re: [WIP PATCH] Re: Inefficient endofloop value computation  missed optimization somewhere? Richard Guenther 2012/02/28