Optimistically assume used GQRs are 0 in blocks that only use one GQR, and
bail at the start of the block and recompile if that assumption fails.
Many games use almost entirely unquantized stores (e.g. Rebel Strike, Sonic
Colors), so this will likely be a big performance improvement across the board
for games with heavy use of paired singles.
lmw/stmw weren't properly setting input and output registers since they use multiple registers.
dcbz was just missing a flag in the instruction tables.
Detects a situation where the game is writing to the dcache at the address being DMA'd. As we do not have dcache emulation, invalid data is being DMA'd causing audio glitches. The following code detects this and enables the DMA to complete instantly before the invalid data is written.
Added accurate ARAM DMA transfer timing.
Removed the addition of DSP exception checking.
Updated ARAM DMA and FIFO write exception checking to uses these types.
Conflicts:
Source/Core/Core/PowerPC/Interpreter/Interpreter_Tables.cpp
Source/Core/Core/PowerPC/PPCTables.h
Also correct behavior with regards to which bits in XER are treated as zero
based on a hwtest (probably doesn't affect any real games, but might as well
be correct).
This should dramatically reduce code size in the case of blocks with
lots of branches, and certainly doesn't hurt elsewhere either.
This can probably be improved a good bit through smarter tracking of register
usage, e.g. discarding registers that are going to be overwritten, but this
is a good start and should help reduce code size and register pressure.
Unlike that sort of change, this is a "safe" patch; it only flushes registers,
which can't affect correctness, unlike actually discarding data.
As part of this, refactor PPCAnalyst to support distinguishing between
float and integer registers (to properly handle instructions that access
both, like floating-point loads and stores).
Also update every instruction in the interpreter flags table I could find
that didn't have all the correct flags.
Tries as hard as possible to push carry-using operations (like addc and adde)
next to each other. Refactor the instruction reordering to be more flexible
and allow multiple passes.
353 -> 192 x86 instructions on a carry-heavy code block in Pokemon Puzzle.
12% faster overall in Pokemon Puzzle; probably less in typical games (Virtual
Console games seem to be carry-heavy for some reason; maybe a different
compiler?)
This value was "helpful" for debugging when the stack got corrupted.
Helpful that if gpr[1](Which is the stack pointer with PPC ABI) is zero then the interpreter would spam huge amounts of annoy text saying that we
managed to get in to a "corrupted" state.
This is incremented every instruction on the interpreter, or every block run on the JIT64....Only if debugging is enabled(JIT64 it is a const
variable)
The message is only outputted when interpreter is used and debugging is enabled.
This time, check the address carefully beforehand, since apparently some games
do horrible things like running it on non-RAM addresses, or at the very least
virtual addresses.
Doesn't support all the FPSCR flags, just the FPRF ones.
Add PPCAnalyzer support to remove unnecessary FPRF calculations.
POV-ray benchmark with enableFPRF forced on for an extreme comparison:
Before: 1500s
After, fmul/fmadd only: 728s
After, all float: 753s
In real games that use FPRF, like F-Zero GX, FPRF previously cost a few percent
of total runtime.
Since FPRF is so much faster now, if enableFPRF is set, just do it for every
float instruction, not just fmul/fmadd like before. I don't know if this will
fix any games, but there's little good reason not to.
In particular, even in code that only runs on x86-64, you can't use
PRIx64 for size_t because, on OS X, one is unsigned long and the other
is unsigned long long and clang whines about the difference. I guess
you could make a size_t specifier macro, but those are horribly ugly, so
I just used casting.
Anyone want to make a nice (and slow) template-based printf?
Now without bare 'unsigned'.