[Prev] Thread [Next]  |  [Prev] Date [Next]

GCC asm block optimizations on x86_64 Darryl L. Miles Tue Aug 28 12:06:45 2007

Please find enclosed a small test sample of code. The comments contained within it explain related points in the generated code.

[1] This issue is in the way %edx is zero'ed, I would think zeroing out registers/memory/whatever would be a special optimization case in this code its clear that there is no useful value in the CPU condition flags, so "xorl %edx,%edx" would make most sense, instead of having to find another register to load with zero before then copying. Interestingly enough -O generates a "mov $0,%r8d", while -O2 generates a "xor %r8d,%r8d".

[2] No issue here, this was just a useful note to explain why %r9 was brought into play. This is due to the constraints of RDX:RAX within the DIV instruction and RDX is used to pass the 3rd function argument.

[3] Since %r8 was brought into play by the compiler generated code, I take it that %r8 is a caller saves in the ABI. So as we have a register free for use, (even after we may have just used it to zero %edx in issue [1] above). So using %r8 here would be a much better option for the purpose of what %ebx is allocated for. I take it that %ebx is a callee saves so we get a push followed by pop, which is unnecessary memory access when we have a register available.

On a side track, when I looked at the code generated for main() "objdump -d u64_divide" I could see that the u64buf structures appear to be aligned to 16 bytes, instead of seeing that value.u32.lw0 and value.u64.ll0 compute to offset 0, seeing the largest type width was 8 bytes and then aligning according to that. Maybe there is another reason for this?

If it is not possible to get GCC to emit assembly code nearer the ideal, does this test case provide anything useful to gain an understanding from.