Voyage, travel, and change of place impart vigor.
— Seneca the Younger, ca 4 BC – 65 AD
Sidreloc
Sidreloc is a tool for relocating SID tunes any
number of whole pages, using a novel but remarkably simple algorithm. It can
also relocate all zero-page variables used by the tune.
Sidreloc is capable of relocating 91% of HVSC #56 out of the box.
Many of the remaining tunes can be relocated by tweaking the settings to fit
them.
Download
Here's the manual page for sidreloc.
While the tool is SID specific, the fundamental algorithm could easily be
adapted to relocate all sorts of 6510/6502 machine code
that meets a couple of requirements (see "Observations" below). Sidreloc is
primarily a linux program, but it's open source (MIT license) and you are
encouraged to port it to different operating systems. Good patches will be
merged into the official version.
Background
SID tunes are music files for the Commodore 64 home
computer. They are typically 3-4 KB large, and consist of a small
header followed by executable 6510 machine code. The header contains a 16-bit
load address; the code must be loaded at this address, and then executed. This
can be done in an emulator running on a modern computer (e.g. sidplay) or on a real C64.
Of the over 40000 SID tunes found in the excellent HVSC, many have been ripped from games or
demos, and the original source code is long lost. Some tunes were written in
assembly language, and then linked to run at a particular address. Others were
programmed in machine code
monitors, where addresses are hardcoded from the start. Yet others were
made using interactive tools such as trackers, but then exported into
executable code linked to run at some given load address.
The following charts show a breakdown of all SID tunes in the HVSC based on
load address and size.
(Note: BASIC tunes have been omitted here and in the following.)
Unfortunately, it is sometimes difficult to combine an existing SID tune with
other code, such as accompanying a new demo effect with a historical SID tune.
Demo effects often rely on bank switching and other
address-related trickery, and need to occupy certain memory regions. On the
other hand, SID tunes are just programs, and should be able to execute from any
place, if it weren't for the irritating little fact that they're already
linked.
The problem
Once linked, it is generally very difficult to relocate (move) a piece of
machine code. This is especially true for 6510/6502 code. To illustrate, here
are two small routines that implement the same functionality: Copying a
2 KB chunk of memory from the current program into the area $3800-$3fff.
This is something you might do in order to set up a font for a C64 program.
Self-modifying version:
* = $1000 ; This informs the assembler that we'll be loaded at $1000
ldy #8 ; 8 iterations of the outer loop
ldx #0 ; 256 iterations of the inner loop
loop
lda font,x ; Copy one byte...
sta $3800,x ; ...into the right place.
inx
bne loop
dey
beq done
inc loop+2 ; For each iteration of the outer loop, modify the load
inc loop+5 ; and store addresses.
jmp loop
done
rts
font
; 2 kb of data follows...
If you are unfamiliar with 6510 code, this might seem utterly terrifying,
but it's standard practice. The index registers (X and Y) are 8 bits wide,
and there's no obvious way to loop efficiently over a large array of bytes
without resorting to self-modifying code.
This is what we get if we run the code through an assembler and a linker,
and then a disassembler:
1000 a0 08 ldy #$08
1002 a2 00 ldx #$00
1004 bd 1a 10 lda $101a,x
1007 9d 00 38 sta $3800,x
100a e8 inx
100b d0 f7 bne $1004
100d 88 dey
100e f0 09 beq $1019
1010 ee 06 10 inc $1006
1013 ee 09 10 inc $1009
1016 4c 04 10 jmp $1004
1019 60 rts
101a ...
The branches are relative, but the jmp uses an absolute address which
depends on where in memory the program is located. To relocate this code, we must
change the operand of the jmp instruction. When relocating a whole
number of pages, we only need to change the most significant byte (MSB) of the
jmp address, i.e. the $10 byte at $1018. For similar reasons, we need
to modify the bytes at $1006, $1012 and $1015. However, we should not change
the byte at $1009, because the font should still be copied into $3800.
Note, by the way, that all the changed bytes happen to have the value $10. This
is because all of the code fits into one page of memory. If the code were
bigger, the font data would be pushed into the next page, and the byte at
address $1006 would be $11, but we'd still need to change it.
So far, all of the addressing modes were based on literal 16-bit addresses.
But now consider the following code:
Explicit pointer version:
* = $1000
lda #<font ; a0/a1 keep track of the reading position.
sta $a0
lda #>font
sta $a1
lda #$00 ; a2/a3 keep track of the writing position.
sta $a2
lda #$38
sta $a3
ldy #0 ; Y will remain zero throughout the routine.
loop
lda ($a0),y ; Copy a byte.
sta ($a2),y
lda $a0 ; Increment the reading pointer.
clc
adc #$01
sta $a0
lda $a1
adc #$00
sta $a1
lda #$01 ; Increment the writing pointer.
clc
adc $a2
sta $a2
lda #$00
adc $a3
sta $a3
lda $a0 ; Check if we're done.
cmp #<fontend
bne loop
lda $a1
cmp #>fontend
bne loop
rts
font
; 2 kb of data follows...
fontend
This routine is not particularly fast, but at least it's not self-modifying.
However, it turns out to be much more tricky to relocate.
This is what the disassembled code looks like:
1000 a9 3d lda #$3d
1002 85 a0 sta $a0
1004 a9 10 lda #$10
1006 85 a1 sta $a1
1008 a9 00 lda #$00
100a 85 a2 sta $a2
100c a9 38 lda #$38
100e 85 a3 sta $a3
1010 a0 00 ldy #$00
1012 b1 a0 lda ($a0),y
1014 91 a2 sta ($a2),y
1016 a5 a0 lda $a0
1018 18 clc
1019 69 01 adc #$01
101b 85 a0 sta $a0
101d a5 a1 lda $a1
101f 69 00 adc #$00
1021 85 a1 sta $a1
1023 a9 01 lda #$01
1025 18 clc
1026 65 a2 adc $a2
1028 85 a2 sta $a2
102a a9 00 lda #$00
102c 65 a3 adc $a3
102e 85 a3 sta $a3
1030 a5 a0 lda $a0
1032 c9 3d cmp #$3d
1034 d0 dc bne l1012
1036 a5 a1 lda $a1
1038 c9 18 cmp #$18
103a d0 d6 bne l1012
103c 60 rts
103d ...
To relocate this, our algorithm needs to be able to figure out that the
operand of the second lda (the byte at $1005) should be relocated,
because it will be stored into $a1 and later end up as the MSB of an address,
as part of an indirect addressing mode. But indirect addressing could just as
well be performed on e.g. addresses that have been computed as the sum of a
relocatable base address and a non-relocatable offset, as in the subsequent
iterations of the loop. Furthermore, the operand of one of the cmp
instructions (the $18 byte at $1039) must also be relocated, otherwise the loop
won't terminate properly.
This second example also illustrates another peculiarity of 6510 code:
Zero-page addresses are not only fast, but also required for indirect
addressing. Hence SID tunes tend to use at least a couple of them. On the C64,
the four bytes at $fb-$fe are documented as "Free Zero Page space for User
Programs", so they get used a lot, increasing the probability of a collision if
we try to combine an existing SID tune with some other code.
Thus we also need to be able to relocate zero-page accesses. Like
regular pointers, zero-page addresses may be computed and
stored into future instruction operands as part of self-modifying code.
So far, we have seen that pointers may appear inside 16-bit operands, but also
in separate 8-bit operands spread out all over the code. Finally, pointers may
appear in data. SID tunes typically contain tables of pointers to data
structures called patterns. These pointers are either stored as regular words
in little-endian order, or split into one LSB table and one MSB table.
Observations
The approach taken in sidreloc hinges on the following crucial observations:
SID tunes have a well-defined API: We're supposed to load the tune into
memory, call an initialisation routine once, then call a playroutine regularly
at some rate (e.g. the video frame rate). Some SID tunes also install a
secondary timer interrupt which is called very rapidly in order to play
samples ("digis").
All of this can be emulated, and we can make reasonable assumptions about how
many calls we need to make to these routines in order to exercise all of the
code.
In practice, it is enough to be able to relocate the tune a whole number of
pages in either direction. Thus we only need to be concerned with address MSBs.
The MSB of almost every pointer that is computed in a 6510
program is the sum of:
- one or more program bytes (bytes that appear verbatim somewhere in
the linked machine code), of which exactly one is the MSB of an address, and
- possibly some offset that is independent of the program
location (such as a computed index into a table).
The output of a relocation algorithm is a set of relocation flags, one
for each program byte, expressing whether or not the relocation offset (in
pages) should be added to this particular byte.
(The same principle applies to zero-page relocation, but some extra
book-keeping is needed; check the source code for details.)
It is not possible to write a perfect algorithm that works for
every SID tune, because a playroutine could theoretically e.g. read out the current
program counter and store it into a critical SID register like $d404. Being able
to relocate most SID tunes is good enough.
We can easily check whether the algorithm worked for a particular SID tune by
emulating the original code and the relocated code side by side, and comparing
the contents of the SID registers after each call to the playroutine.
The last two observations taken together also imply that we don't have to be
conservative in what we do; if we're unsure about something, we make an
educated guess that is correct in most cases, and later check if it was true
for this particular tune.
The algorithm
The trick is to emulate a 6510 processor and a 64 KB memory, but inside
every memory cell (including the registers) we not only track the current 8-bit
value, but also the list of program bytes which contributed to this value.
Whenever an effective address is computed inside the emulated processor, we
check whether this address is in the relocation range (e.g. the extent of the
program), and if so, we note down the fact that exactly one of these program
bytes must be relocated.
When a byte is copied e.g. from memory into a register, not only is the
value copied, but the list of contributing program bytes is copied along with
it. When a sum is computed (adc instruction), the result is not only
the computed sum, but also the concatenation of the lists of program bytes
contributing to both operands. When other computations are performed, the resulting
value is associated with an empty list of contributors.
Furthermore, every time a comparison is made (using cmp or
eor), we note down that the two lists of program bytes contributing to
the operands must have equal relocation status: Either they're both independent
of the program location, or they're both dependent of the program location. In
the latter case this implies that each list must contain exactly one program
byte which is relocated.
Having visited all parts of the code, we thus end up with a collection of
facts (equations) relating the relocation flags of program bytes to each other.
Then we use a constraint
solver to find a solution to this set of equations. Flags of unknown status
(corresponding to program bytes which do not appear in any equation) are
treated as "don't relocate".
Zero-page
For zero-page relocation, we also need to keep track of which program bytes
contribute to which zero-page addresses. We need a bitfield for this, since a
program byte often points to more than one zero-page variable: The operand
of lda ($a0),y contributes to the effective addresses $a0 and $a1.
We still get equations describing lists of program bytes of which exactly one
should be relocated.
Once we have a solution to the set of equations, we visit each program byte
that was flagged for relocation. For regular pointer MSBs, we add the main
relocation offset. For zero-page addresses, we add the relocation offset for
those zero-page addresses to which the program byte contributed. For this to
work, some zero-page addresses must of course move together.
Example
When sidreloc is run with the verbose option (-v), it prints a map of
all the program bytes. The character R indicates that the relocation
offset (difference in pages between the old and new code locations) has been
added to this byte. The character Z indicates that some zero-page
relocation offset has been added to this byte.
Here's the output you get when relocating Zoids to page $42, with a free
zero-page region starting at $80. If you are familiar with Hubbard's
playroutine, you'll immediately recognise the pattern pointers MSB table (a
large stretch of consecutive R bytes) and the subtune tables (one
"===RRR" chunk per subtune, consisting of three track pointers). The
first third of the file is code, and the rest is mostly data.
Zoids, Rob Hubbard, 1986 Martech, $1000-$1a7f, 3 subtunes
Relocating from $1000-$5aff to $4200-$8cff
Analysing subtune 1
Analysing subtune 2
Analysing subtune 3
Program map:
1000, 4200: ==R==R==R==R========R====R==R==R==R=====R==R===?==========?====?
1040, 4240: ==R==R====R====R==R==R==R===R==R====R=Z==R=Z==R====R...==R==R=Z=
1080, 4280: ?===?====R==R====R==R==R==R...===R=Z==R=Z=?==R==R=?==R=Z==R==R=?
10c0, 42c0: ==R=?===R=?===?=====R====R==R====Z====R==R==R==R==Z==R====R==R==
1100, 4300: R==R=====R==R=====R==R==R==R==R==R======R==R==R==R=====R=====R==
1140, 4340: ===R=====R=====R==R==R==R==R=Z=?======R==R==R==R==R=?====R====R=
1180, 4380: ?====?========R======R==R==R==R==R==R==R====R=?=?===?==R==R=====
11c0, 43c0: R==R==R==R==R===R==R====R==R==R==R==R==R=?=?====R======R==R==R==
1200, 4400: R==R==R==R==R==R=====R=====R=?====R==R==R==R==R=====R==R====R=?=
1240, 4440: =R=.==R==R=?==R==R====R===R===R=?=?==?====R==R===R==R===R=?=?==?
1280, 4480: ====R==R==R===R======R=====R==R==R===?==R==R=?=====R==R==R=====R
12c0, 44c0: =?==R=====R===R==R==R=====R=?==R=====R=?====R====R====R=?==?==R=
1300, 4500: =R====R==R=====R=?====R====?=====R=?====R=?======R======R=?====R
1340, 4540: =.==R==R=====R=?====R=?====R==?==R==R====R==R==R==R=====R=======
1380, 4580: =R=........................??............????..????????..????...
13c0, 45c0: .????..?????????????????????????????????????????????????????????
1400, 4600: ?????????????????????????????????????????????............??....?
1440, 4640: ?..===???????????????????????????????????..?..??????????????????
1480, 4680: ????????????????????????????????????????????????????????????????
14c0, 46c0: ??????????????????????........????????................??????===R
1500, 4700: RR===RRR===RRR===============================RRRRRRRRRRRRRRRRRRR
1540, 4740: RRRRRRRRRRRR================================================?===
1580, 4780: ================================================================
15c0, 47c0: ================?===============================================
1600, 4800: =====================================?==.==.==?==.====.====?????
1640, 4840: ????????????????????????????????????????????????????????????????
1680, 4880: ????????????????????????????????????????????????????????????????
16c0, 48c0: ????????????????????????????????????????????????????????????????
1700, 4900: ????????????????????????????????????????????????????????????????
1740, 4940: ????????????????????????????????????????????????????????????????
1780, 4980: ????????????????????????????????????????????????????????????????
17c0, 49c0: ????????????????????????????????????????????????????????????????
1800, 4a00: ????????????????????????????????????????????????????????????????
1840, 4a40: ????????????????????????????????????????????????????????????????
1880, 4a80: ????????????????????????????????????????????????????????????????
18c0, 4ac0: ????????????????????????????????????????????????????????????????
1900, 4b00: ????????????????????????????????????????????????????????????????
1940, 4b40: ????????????????????????????????????????????????????????????????
1980, 4b80: ????????????????????????????????????????????????????????????????
19c0, 4bc0: ????????????????????????????????????????????????????????????????
1a00, 4c00: ?????????????????????????????????????????????????=====R====R===R
1a40, 4c40: ==R===?===?=============?====?==R==?==R=........................
MSB relocations (R): 233
Zero-page relocations (Z): 9
Static bytes (=): 969
Status undetermined (?): 1349
Unused bytes (.): 128
Old zero-page addresses: fb fc fd fe
New zero-page addresses: 80 81 82 83
Verifying relocated subtune 1
Verifying relocated subtune 2
Verifying relocated subtune 3
Bad pitches: 0, 0%
Bad pulse widths: 0, 0%
Relocation successful.
Largest unused region: $4d00-$9fff
Limitations
There are a couple of situations where the algorithm fails.
Unrelated memory
This concerns code which accesses memory completely unrelated to the program
location. For instance, some SID tunes decrunch data into $e000-$ffff. The
relocated tune works great, but it will also access this memory; hence, it
can't coexist with demo effects or other tunes which happen to use the same
area of memory for data. Sidreloc prints warnings about out-of-bounds memory
accesses.
In most cases, such a situation can be resolved by relocating in multiple
steps. For instance, suppose a SID tune loads into pages $10-$1e, and also
makes use of RAM at pages $90-$93. You want to relocate this tune so that it
only affects memory pages in the $e0-$ff range. This is done in two steps: First, you
specify -r 10-1e -p 81 in order to create an
intermediate SID file where the code is located close to the data area.
Then, you specify -r 81-93 -p e0 to move both code and
data to the desired location.
Compressed code
The algorithm assumes that addresses are computed from raw
program bytes, as they appear in the linked code. If code is read from a
compressed stream of bits, there is no way we can add an offset to a set of
program bytes and somehow end up with a compressed bit stream which happens to
decrunch to relocated code.
Subtraction
Subtraction (sbc instruction) can be used in a couple of different ways:
Subtract an offset from an address, producing an address.
Subtract two addresses, producing a difference.
Subtract two numbers, producing a difference.
At the time when we encounter the sbc instruction, we do not know
which case to pick. The current implementation makes the assumption that
case 2 never occurs (the choice among cases 1 and 3 can be
postponed), since this behaviour seems to be good enough in practice.
Data table overflow
Most SID tunes contain a table that maps note numbers to frequencies,
for somewhere around 96 notes
(8 octaves). Occasionally the playroutine ends up using a note number which
is too high. This might be used for sound effects where the pitch doesn't
really matter, or during periods when the voice is playing at zero volume
anyway. However, if an address is stored after the table, the MSB of this
address might get picked up and written to a SID register. When the code is
relocated, the pitch changes. For this reason, sidreloc will verify a relocated
tune even if some pitches are wrong, but the amount of bad pitches will be
reported.
A similar problem concerns pulse widths. Some playroutines store pulse width
information in an instrument data structure, and if an invalid instrument is
used, a pointer MSB might get picked up and used as a pulse width (or pulse
width delta). In most cases this doesn't affect the perceived quality of the
music, because if the composer desired a specific pulse-width (e.g. a
non-modulating square wave), then it would've been very improbable for him or her to
accidentally achieve this with an invalid instrument number.
Often, pulse width is treated as an integrator, meaning that a delta value
is added in each call to the playroutine. When relocation results in a
different delta value, this will make all future pulse width values wrong.
Hence, the reported amount of bad pulse width values is often very high,
even though the music sounds just like the original.
No playroutine
Some SID tunes take over the entire C64, and never return from the
initialisation routine. These tunes tend to use up all CPU cycles and nearly
all RAM. It should be possible to apply the relocation algorithm here too, by
extending the CPU emulation to also include the CIA timers and the raster
position register ($d012), and sample the SID registers at some reasonable
cycle-interval. However, that would be of limited practical use, because such
extreme SID tunes would be very hard to combine with other code anyway.
Evaluation
A script was used to run sidreloc against all (non-BASIC) tunes in
HVSC #56, with all printouts redirected to a log file. Other scripts
were then used to compile statistics from the log file.
Hints
If you're trying to relocate a SID tune, and the tool fails with the default
settings, here are some things you could try:
If the verification fails with a lot of bad pitches or pulse
widths, try the -f option to force sidreloc to output the
file anyway. In many cases there is no audible difference.
If you get warnings about memory accesses outside the relocation
range, but very close to it, then you should extend the relocation
range using the -r option.
Try the -k option if you can live without zero-page
relocation. This reduces the number of equations, potentially removing
the one which was causing trouble.
If -k works, but you need to relocate some of the
zero-page accesses, then you might get away with adjusting the range of
free zero-page addresses (using the -z option) so that
some of the addresses end up at their original location, while others
are moved. In particular, avoid relocating zero-page addresses which
could also be interpreted as single-byte opcodes (e.g. $a8 and $c8).