PowerPC has a 32 bit CR register, which is used to store flags for results of
computations. Most instructions have an optional bit that tells the CPU whether
the flags should be updated. This 32 bit register actually contains 8 sets of 4
flags: Summary Overflow (SO), Equals (EQ), Greater Than (GT), Less Than (LT).
These 8 sets are usually called CR0-CR7 and accessed independently. In the most
common operations, the flags are computed from the result of the operation in
the following fashion:
* EQ is set iff result == 0
* LT is set iff result < 0
* GT is set iff result > 0
* (Dolphin does not emulate SO)
While X86 architectures have a similar concept of flags, it is very difficult
to access the FLAGS register directly to translate its value to an equivalent
PowerPC value. With the current Dolphin implementation, updating a PPC CR
register requires CPU branching, which has a few performance issues: it uses
space in the BTB, and in the worst case (!GT, !LT, EQ) requires 2 branches not
taken.
After some brainstorming on IRC about how this could be improved, calc84maniac
figured out a neat trick that makes common CR operations way more efficient to
JIT on 64 bit X86 architectures. It relies on emulating each CRn bitfield with
a 64 bit register internally, whose value is the result of the operation from
which flags are updated, sign extended to 64 bits. Then, checking if a CR bit
is set can be done in the following way:
* EQ is set iff LOWER_32_BITS(cr_64b_val) == 0
* GT is set iff (s64)cr_64b_val > 0
* LT is set iff bit 62 of cr_64b_val is set
To take a few examples, if the result of an operation is:
* -1 (0xFFFFFFFFFFFFFFFF) -> lower 32 bits not 0 => !EQ
-> (s64)val (-1) is not > 0 => !GT
-> bit 62 is set => LT
!EQ, !GT, LT
* 0 (0x0000000000000000) -> lower 32 bits are 0 => EQ
-> (s64)val (0) is not > 0 => !GT
-> bit 62 is not set => !LT
EQ, !GT, !LT
* 1 (0x0000000000000001) -> lower 32 bits not 0 => !EQ
-> (s64)val (1) is > 0 => GT
-> bit 62 is not set => !LT
!EQ, GT, !LT
Sometimes we need to convert PPC CR values to these 64 bit values. The
following convention is used in this case:
* Bit 0 (LSB) is set iff !EQ
* Bit 62 is set iff LT
* Bit 63 is set iff !GT
* Bit 32 always set to disambiguize between EQ and GT
Some more examples:
* !EQ, GT, LT -> 0x4000000100000001 (!B63, B62, B32, B0)
-> lower 32 bits not 0 => !EQ
-> (s64)val is > 0 => GT
-> bit 62 is set => LT
* EQ, GT, !LT -> 0x0000000100000000
-> lower 32 bits are 0 => EQ
-> (s64)val is > 0 (note: B32) => GT
-> bit 62 is not set => !LT
This isn't technically the correct place to have the downcount variable, but it is similar to what PPSSPP does to gain a bit of extra speed on ARM.
We access this variable quite a bit, with each exit in a block it is subtracted from.
On ARM this required four instructions to load and store the value, while now it only requires two.
This gives an average of 1FPS gain to most games.
Examples:
Crazy Taxi: 54FPS -> 55FPS
Luigi's Mansion: 20FPS -> 21FPS
Wind Waker(Save Screen): 27FPS -> 28FPS
This seems to average a 6mhz to 16mhz CPU core emulation improvement in the few games I've tested.
Interpreter::Helper_UpdateCR1 doesn't use the argument passed to UpdateCR1. It pulls its value from the FPSCR register.
Also there was a Interpreter::Helper_UpdateCR1(float) in addition to Helper_UpdateCR1(double) that hasn't ever existed. Remove the function
declaration.
The alert apparently triggers on Midway Arcade Treasures 2; given that the
game otherwise works fine, it's not a high priority to accurately emulate
the bit in question.
Fixes issue 7197.
The workaround of using fixed underlying types produces lots of warnings
in GCC because now the bit-fields are too small for the value range used
for conversion semantics.
Our defines were never clear between what meant 64bit or x86_64
This makes a clear cut between bitness and architecture.
This commit also has the side effect of bringing up aarch64 compiling support.
This method doesn't involve messing around with the quirks of the x87
FPU and should be reasonably fast. As a bonus, it does the correct thing
for out-of-range doubles.
However, it is also a little slower and only benefits programs that rely
on undefined behavior so it is disabled for now.