|
Loading...
|
gcc@gcc.gnu.org
[Prev] Thread [Next] | [Prev] Date [Next]
Re: Inefficient end-of-loop value computation - missed optimization somewhere? Andrew Pinski Tue Feb 21 13:00:18 2012
On Tue, Feb 21, 2012 at 1:27 AM, Richard Guenther
<[EMAIL PROTECTED]> wrote:
> On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> we've noticed that the loop optimizer sometimes inserts weirdly
>> inefficient code to compute the value of an induction variable
>> at the end of the loop.
>>
>> As a test case stripped down to the core of the problem, consider:
>>
>> int test (int start, int end)
>> {
>> int i;
>>
>> for (i = start; i < end; i++)
>> ;
>>
>> return i;
>> }
>>
>> That function really ought to be equivalent to
>>
>> int test2 (int start, int end)
>> {
>> return start < end ? end : start;
>> }
>>
>> And indeed on i386-linux-gnu, we get mostly identical code
>> (except for the branch direction prediction).
>>
>> However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
>> we see for the first function:
>>
>> test:
>> cmp r0, r1
>> bge .L2
>> adds r3, r0, #1
>> mvns r0, r0
>> adds r1, r1, r0
>> adds r0, r3, r1
>> .L2:
>> bx lr
>>
>> instead of simply (as for the second function):
>>
>> test2:
>> cmp r1, r0
>> it ge
>> movge r0, r1
>> bx lr
>>
>>
>>
>> Where does that weird addition-and-negation sequence come from?
>> In 100t.dceloop we still have:
>>
>> <bb 4>:
>> # i_9 = PHI <i_5(5), start_2(D)(3)>
>> i_5 = i_9 + 1;
>> if (end_4(D) > i_5)
>> goto <bb 5>;
>> else
>> goto <bb 6>;
>>
>> <bb 5>:
>> goto <bb 4>;
>>
>> <bb 6>:
>> # i_1 = PHI <i_5(4)>
>>
>>
>> But 102t.sccp has:
>>
>> <bb 4>:
>> # i_9 = PHI <i_5(5), start_2(D)(3)>
>> i_5 = i_9 + 1;
>> if (end_4(D) > i_5)
>> goto <bb 5>;
>> else
>> goto <bb 6>;
>>
>> <bb 5>:
>> goto <bb 4>;
>>
>> <bb 6>:
>> D.4123_10 = start_2(D) + 1;
>> D.4124_11 = ~start_2(D);
>> D.4125_12 = (unsigned int) D.4124_11;
>> D.4126_13 = (unsigned int) end_4(D);
>> D.4127_14 = D.4125_12 + D.4126_13;
>> D.4128_15 = (int) D.4127_14;
>> i_1 = D.4123_10 + D.4128_15;
>>
>> This is the sequence that ends up in the final assembler code.
>>
>> Note that this computes:
>>
>> i_1 = (start_2(D) + 1)
>> + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))
>>
>> which is (since those casts are no-ops on the assembler level):
>>
>> i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)
>>
>> which is due to two's-complement arithmetic:
>>
>> i_1 = start_2(D) - start_2(D) + end_4(D)
>>
>> that is simply:
>>
>> i_1 = end_4(D)
>>
>>
>> Unfortunately, no tree-based optimization pass after 102t.sccp is able to
>> clean up that mess. This is true even on *i386*: the only reason we get
>> nice assembler there is due to *RTX* optimization, notably in combine.
>> This hits on i386 because of an intermediate stage that adds two registers
>> and the constant 1 (which matches the lea pattern). On arm, there is no
>> instruction that would handle that intermediate stage, and combine is not
>> powerful enough to perform the whole simplification in a single step.
>>
>>
>> Now that sequence of gimple operations is generated in scev_const_prop
>> in tree-scalar-evolution.c. First, the phi node corresponding to the
>> loop exit value is evaluated:
>>
>> def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);
>>
>> which returns a chrec of the form
>>
>> { 1, +, (start + 1) }
>>
>> This is then further simplified by
>>
>> def = compute_overall_effect_of_inner_loop (ex_loop, def);
>>
>> which first computes the number of loop iterations:
>>
>> tree nb_iter = number_of_latch_executions (loop);
>>
>> which returns a tree representing
>>
>> (unsigned int) ~start + (unsigned int) end
>>
>> (which at this point seems to be the validly most efficient way to
>> compute end - start - 1).
>>
>> The chrec is then evaluated via chrec_apply which basically computes
>>
>> (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)
>>
>> which ends up being converted to the long gimple sequence seen above.
>>
>>
>> Now I guess my questions are:
>>
>> - In the computation shown above, should that tree have been folded
>> into just "end" to start with? The tree is constructed at its
>
> The issue is that (start + 1) + 1 * (int) ... is computed using signed
> integer arithmetic which may invoke undefined behavior when it wraps.
> Thus we cannot re-associate it. I suppose if you'd arrange the code
> to do the arithmetic in unsigned and only cast the result back to the
> desired type we might fold it ...
>
>> outermost level in chrec_fold_plus, which simply calls fold_build2.
>> While simplify-rtx.c has code to detect sequences of operations
>> that cancel out while folding a PLUS or MINUS RTX, there does
>> not appear to be corresponding code in fold-const.c to handle
>> the equivalent optimization at the tree level.
>
> There are. fold-const.c has some re-association logic to catch this
> and we have tree-ssa-reassoc.c. All of those only handle same-sign
> operands though I think.
>
>> - If not, should there be another tree pass later on that ought to
>> clean this up? I notice there is code to handle plus/minus
>> sequences in tree-ssa-reassoc.c. But this doesn't seem to be
>> able to handle this particular case, possibly because the presence
>> of the casts (nop_exprs) ...
>>
>> - Anywhere else I'm overlooking right now?
>>
>>
>> I'd appreciate any tips / suggestions how to fix this ...
>
> Look at doing the computation in unsigned arithmetic.
I have some patches which improve this but it is more of a cleanup and
it is not fully done yet.
Right now we catch the case in gfortran.dg/reassoc_6.f which has the
above pattern but only in VPR and only during the simplifing phase, I
am working on improving that to get to the point where we can
recognize this while getting the range in VPR.
Thanks,
Andrew Pinski
- Inefficient end-of-loop value computation - missed optimization somewhere? Ulrich Weigand 2012/02/20
- Re: Inefficient end-of-loop value computation - missed optimization somewhere? Richard Guenther 2012/02/21
- Re: Inefficient end-of-loop value computation - missed optimization somewhere? Andrew Pinski 2012/02/21 <=
- [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere? Ulrich Weigand 2012/02/28
- Re: [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere? Richard Guenther 2012/02/28