One of the disadvantages is that undefined behaviour vastly reduces the ways in which you can express your intent, and that includes ways that compilers tend to compile into faster code (unless they miscompile it because they make assumptions based on undefined behaviour).
As a simple example, consider a programming language implementation written in assembly language for Aarch64, or in C. (This example is actually derived from a real programming language implementation). In this implementation, you want to always report division by zero and division overflow as errors. However, on Aarch64 (64-bit ARM instruction set) the CPU does not trap on these conditions, but produces some result, so you have to perform the checking yourself (32-bit ARM and 32-bit and 64-bit PowerPC also work like this). E.g., you could write it as follows:
sdiv x2, x0, x1 cbz x1, 3c
cmp x0, x4 ccmn x1, #0x1, #0x0, eq // eq = none b.eq 3c // b.none
Here you start the division, and then perform the checks while the divider works. If you want to express this in (more-defined) C, you can do it as follows
q=n1/n2; if (n2==0) abort(); if (n2==-1 && n1==LONG_MIN) abort();
If the compiler compiles it as intended (gcc-6.3.0 and 8.3.0 happen to do so), this is portable between CPUs; on CPUs where the division does not trap on these conditions, the ifs catch them. On CPUs where division traps on these conditions, already these traps produce the desired result.
However, this code has a problem: Division by zero and integer
overflow are undefined behaviours, so, because the division comes
demon compiler might assume that n1 and n2 are already such that
ifs can never trigger even on CPUs where the division
does not trap on such inputs. Consequently, it could "optimize"
ifs away, and the programming language we want to
implement would not catch these cases. To avoid this problem, we
could rewrite the code as
if (n2==0) abort(); if (n2==-1 && n1==LONG_MIN) abort(); q=n1/n2;
The disadvantage of this variant is that, if compiled in a straightforward way, the division is only started after the checks are performed, so there is no overlap between the checking and the long-latency division. And indeed, gcc-6.3.0 and gcc-8.3.0 compile this into code that performs the sdiv after the checking code:
cbz x1, 50
cmp x2, x4 ccmn x1, #0x1, #0x0, eq // eq = none b.eq 50 // b.none sdiv x0, x2, x1
sub, and then typing
cd sub make -f ../MakefileYou should take a look at the machine code (with
objdump -d divloop*.o ooo*.o) to see if the compiler actually compiled the benchmarks as intended.
You can find the code I have used for measuring various CPUs in the subdirectories here.
The division-before-checking variant takes 7 cycles per iteration on a Cortex-A53, while the checking-before-division variant takes 8 cycles (for divisions with quotient 100; 5 vs. 6 cycles for quotient 1, larger numbers for larger quotients, but (surprising to me) not bigger differences). So, if you want to avoid undefined behaviour, it slows down this program.
After a misprediction, the situation changes: In the check-before-division code, the CPU has to fetch and decode the checking instructions before it sees the division, and this can take several cycles. In the division-before-check code, the CPU sees the division instruction right away, and if the data is present (which it usually is after a misprediction), can start dividing several cycles earlier. Programming language interpreters have a relatively high branch misprediction rate on bytecode/virtual machine instruction dispatch, so this actually has practical relevance.
The microbenchmarks for OoO CPUs (which also works for in-order
CPUs) consist of
ooo.h. The resulting binaries are
They use a pseudo-random
switch between 4 variants of the division code (so on
CPUs that predict the switch, there is a 25% probability of correctly
predicting the case and a 75% chance of mispredicting). Right at the
start of the switch case, there is the division code. The quotient is
used in the computation of the next switch value, to ensure that it
plays a role when the division starts (in real code, long-latency
operations like divisions are often on the critical data-flow path of
some computation, so that's realistic).
Results (in cycles per iteration):
ooomb oooub 55.6 55.8 Cortex-A9 (Pandaboard) arm gcc-4.6.3 22.2 24.1 Cortex-A53 (Odroid C2, in-order) aarch64 gcc-6.3.0 29.7 31.2 Cortex-A72 (Rockpro64) aarch64 gcc-6.3.0 23.9 25.4 Cortex-A73 (Odroid N2) aarch64 gcc-6.3.0 41.9 41.9 PPC 7447A (iBook G4) ppc (32-bit) gcc-4.3.2 130.0 130.0 PPC 970 (PowerMac G5) ppc64 gcc-4.4.5We see the predicted effect on the Cortex-A53, A72, and A73, but not on the other CPUs.