NDS Code Injection

Goal

The goal of the code injection framework is to modify an ARM binary in an easy way to repoint to our own code.
There are some very basic methods to steal the control flow based on patterns in the generated assembly.

Function pointers are relevant when a table of function pointers is used, like where a set of commands map to their own functions.
For this, replacing the function pointer is enough to steal control flow.

Repoints are relevant when a table is used and all that's desired is different elements in that table.
This is particularly when the desire is to expand the table beyond its initial memory scope.

Hooks are a more invasive method of directly inserting code that jumps to a different location.
This involves manually overwriting code to jump to a different location with code that we manage.

Hook Methodology

There are two hook constructs used by hg-engine: register-specific long jumps and full function replacements.
The register-specific jump is simple:

ldr rN, =SymbolToJump
bx rN
.pool

Compiled to THUMB, this takes 8 bytes (or 10 depending on the instruction).
N is a register from 0-7 inclusive (as limited by THUMB).
This allows us to jump directly to a different address, and the code this compiles to is simple:

00 48 - ldr r0, [pc, #0]
00 47 - bx r0
EF BE AD DE - 0xDEADBEEF little endian

With different registers and at an address that is not word-aligned, this comes out to something like...

01 4B - ldr r0, [pc, #2]
18 47 - bx r3
00 00 - alignment because 0xDEADBEEF is not word aligned
EF BE AD DE - 0xDEADBEEF little endian

This generalizes to a series of bytes with target memory offset ADDRESS like so:

((ADDRESS & 0x3) != 0, (48 | register), (register << 3), 47, [00, 00,] (LE)ADDRESS

This works well when there is a free register that is not being used,
like on return from a previous function call.
This is not always guaranteed, however...

Not infrequently, a function has more than 3 arguments.
Per ARM standard, this means then that no registers are free at the start of these functions.
This is a nightmare for situations where you want to replace entire functions.
I developed a second snippet to take care of this that's a little more in-depth.

Preservation

This second snippet has a big requirement: It must maintain the values of all of the registers. The obvious solution would be to turn to the stack at this point. The stack would allow storage of a value temporarily as long as it is retrieved and realigned everything once control of the program flow is returned to the original ROM. However, the ARM standard throws a bone in the plans by stipulating that functions with 5 or more arguments throw their later arguments onto the stack. This adds that requirement to our second type of hook: complete stack preservation.

The only way to achieve this becomes clear: usage of the branch with link (bl) instruction. This is an instruction that only modifies one register: the link register. The link register is kept in order to track program flow and return to the caller of the function once the function is finished.

In order to achieve this and keep the link register, there needs to be some location that we can store the value of the link register for preservation overall. hg-engine then takes advantage of a quirk of the NDS that isn't present in its predecessor, the GBA: the ROM is not mapped to executable space. Code needs to be dumped to the EWRAM in order to run it at every instance.

What this lets us do is actually just store the link register directly alongside the executable code that we are modifying at runtime for safe restoration. This can even account for recursion if all of the recursion is done outside of the original ROM code at all, so we do not have to worry about that. This snippet looks something like this:

push {r5-r6}
ldr r5, =lrStorageSpace
mov r6, lr
str r6, [r5]
pop {r5-r6}
bl SymbolToJump
ldr r1, =lrStorageSpace
ldr r1, [r1]
mov pc, r1

.pool

lrStorageSpace:
.word 0

After much trial and error that could have been spent just browsing through some ARMv5 standard to figure out the same meaning, the bytes that this assembles to can be documented. bl as an instruction does a lot more than the other instructions behind the scenes--in fact, it is actually 2 instructions even in THUMB that calculate a high half of the target instruction and a low half of the target instruction.

I initially had a very vague idea about a supposed limit on instructions that could be skipped over in THUMB that would be problematic for implementation, but had to resort to implementation attempts in order to verify if it would be problematic.
So I had to discern how the assembler output the instructions so that they could be manually constructed for automated insertion. The immediate two instructions turned out to be F000 F800 in that order. The combination of the two instructions encodes how many bytes are to be skipped over. The remaining 11 bits of each instruction stored half of the 22 bit total. The actual value of bytes is double of what is specified in the instruction (because there is no use in specifying a byte-aligned address to jump to), allowing for a total range of 2^23 bytes that are possible to skip. To allow both forward and backward jumping, this is a 2's complement signed integer. This then allows for a total range of bytes of -0x400000 - +0x3FFFFE. Given that the NDS has exactly 4 MB of EWRAM, this completely covers every bit of code editing for our purposes.
This also requires 0x1C bytes compared to the other method's 0x8, but maintains the registers and stack just fine.

Automation

What is now present is the idea behind all of the hook insertions. There should now be a method that can be used to easily specify the hooks such that a script can automate their placement in the binaries. In order to get this, we need to get a dump of the symbols in the objects that we create and where they have been assembled to--their memory addresses. Per research done by a fellow community member Mikelan98, the NDS Pokémon games actually have a quite large region of memory that by and large goes unused in gameplay at 0x023C8000 through the end of the EWRAM. This allows us a region of memory with which we can store all of our generated code.

The NDS introduced an overlay system with which code can be inserted into the game with individual binaries and selectively loaded at runtime. This code is position-dependent unlike other REL files from i.e. N64 or DLL files, but we forgive Nintendo for their early embedded systems explorations. Our code can directly be added as a new overlay and loaded in using the built-in system when the game starts to give us a permanent code expansion. The format decided on was simple:

overlayNum symbolName addressToJumpFrom [usedRegister]
for example...
arm9 CreateBoxMonData 0206DED0
0012 CheckCanTakeItem 02241334 0

Here, CreateBoxMonData is jumped to from the memory address 0x0206DED0 in the ARM9 using the second hook that maintains registers and stack.
CheckCanTakeItem is jumped to from the memory address 0x02241334 in overlay 12 using r0 to get there, cobbling r0 in the process with the destination address.
These addresses and symbols are easily obtained using nm on the output of as chained into ld to generate a linked object file.
A Python script was then made to go through and parse these. Further similar formats were made for table repoints, function pointers, and even raw byte replacements as detailed at the beginning. Keen observers may even notice that the register-specific hooks are duplicated for ARM functions, of which there are not many in HeartGold.
The overlay system can even be abused more to selectively load code and extend the usability of this roughly 96 kB far beyond the initial scope. That is a subject for a different article though.

Overall, I am satisfied with the result.