Parallelogram is a demo running on the Commodore One extender
board, which contains an Altera Cyclone III FPGA and an SDRAM chip. The logic
design was made from scratch, including a homebrew CPU, FM synth and blitter
with pixel shader support. The demo won the wild compo at Revision 2012.
The system is coded in Verilog and compiled used Altera's free toolset
(Quartus Web edition). PLLs, multipliers and memory blocks are instantiated
from within Quartus using so called megafunctions, but the rest of the project
consists of plain Verilog files edited with Vim. I used gtkwave to simulate
parts of the system when things didn't work, and sometimes that was very
helpful.
The overall architecture is illustrated in the presentation video around the
1 minute mark: The CPU is in control of execution, and accesses the
external memory through a 16 KB cache. Since I have no control over the
initial contents of the SDRAM chip, the demo must be stored somewhere on the
FPGA. I opted for a solution where the cache is preloaded with the demo binary
at boot, marked as dirty. As other memory gets accessed, the demo gets written
"back" into the SDRAM. This limits the demo to 16 KB.
Memory
The SDRAM has a 16-bit bus width, and this property permeats the entire
design. Pixels are stored as a0rrrr0gggg0bbbb, where the a
bit is a generic alpha bit that can be used freely by software. It conveniently
coincides with the sign bit. The point of having zeroes between the fields is
that it simplifies saturated addition of colours.
There's an embarrassing error in the text at the beginning of the demo,
where it says that only 128 KB of external memory is used. In fact, the
system uses 2 MB (1 megaword) of the SDRAM, which requires 20 address
bits, but the CPU only has direct access to the first 128 KB because
addresses are stored in 16-bit registers. Memory is treated as a rectangular
grid of words, 2048 rows by 512 columns. The blitter uses row/column addressing,
and has access to the entire 2 MB. Frame buffers are 320 by 240 pixels,
and are stored as sub-rectangles occupying columns 0 through 319.
Memory map
(Feel free to skip ahead if you're not interested in this much detail...)
The cache is direct-mapped, which means that memory addresses where the low
bits are identical will compete for the same cache entry. By placing data (e.g.
textures) in columns 320 through 511, it will remain in the cache even when the
frame buffer is accessed.
VGA
The VGA generator consists of a frontend and a backend. The frontend reads
pixels directly from SDRAM and writes them to a FIFO. Since each rasterline is
stored in a single SDRAM row, the entire rasterline can be read in one
burst. Between the lines, the frontend backs off so other parts of the system
can access the memory.
The backend runs in a separate clock domain. At vertical blanking, it sends
an asynchronous signal back to the frontend to trigger a new frame, and then it
reads 320*240 pixels from the FIFO. Each row is stored in a buffer and emitted
twice, since the VGA signal has 480 rows.
The address of the frame buffer is CPU-controlled, and Parallelogram uses triple buffering.
CPU
The CPU was written from scratch. I considered using an existing design, but
it was more fun to do it myself, and I was able to take advantage of the added
flexibility. For instance, at one point the demo was slightly larger than
16 KB, but I could fix this by adding some new instructions and a new
addressing mode in order to make the code compress better.
The CPU is not particularly fast, because most of the work is done by the
pixel shaders. Hence, it is implemented without pipelining. There are eight
general purpose 16-bit registers. Other registers include a program counter, a
stack pointer, a 32-bit product register (accessed as a high and a low half)
and status bits (zero and carry). These are accessed using special
instructions.
Starting at address 0, there are three vector instructions, which are
typically relative jumps: Boot, UART and timer. The boot
instruction is executed at boot. The UART instruction is executed (after
pushing the program counter) whenever a byte appears on the debug UART; this
was used to load new code into the running system during development. The timer
instruction gets executed (after pushing the program counter) every 10 ms,
and controls music playback.
This is what the instruction set looks like:
Instructions
Move immediate high (d <- c * 32)
00 ccc ddd ccccc ccc movih d, c
Arithmetic/Logic
01 000 ddd 00000 sss add d, s
01 000 ddd 1cccc-ccc addi d, c
01 001 ddd 00000 sss adc d, s
01 001 ddd 1cccc-ccc adci d, c
01 010 ddd 00000 sss sub d, s
01 010 ddd 1cccc-ccc subi d, c
01 011 ddd 00000 sss and d, s
01 011 ddd 1cccc-ccc andi d, c
01 100 ddd 00000 sss or d, s
01 100 ddd 1cccc-ccc ori d, c
01 101 ddd 00000 sss xor d, s
01 101 ddd 1cccc-ccc xori d, c
01 110 ddd 00000 sss cmp d, s
01 110 ddd 1cccc-ccc cmpi d, c
01 111 ddd 00000 sss mov d, s
01 111 ddd 1cccc-ccc movi d, c
Branch (o = signed offset relative pc)
10 0 0001 oooooo-ooo bgt label
10 0 0011 oooooo-ooo bne label
10 0 0101 oooooo-ooo bcc,bge label
10 0 1010 oooooo-ooo bcs,blt label
10 0 1100 oooooo-ooo beq label
10 0 1110 oooooo-ooo ble label
10 0 1111 oooooo-ooo bal label
Subroutine call
10 1 0001 oooooo-ooo cgt label
10 1 0011 oooooo-ooo cne label
10 1 0101 oooooo-ooo ccc,cge label
10 1 1010 oooooo-ooo ccs,clt label
10 1 1100 oooooo-ooo ceq label
10 1 1110 oooooo-ooo cle label
10 1 1111 oooooo-ooo cal label
Memory
11 000 ddd ooooo sss ld d, s+o
11 001 ddd ooooo sss st s+o, d
I/O
11 010 ddd 00ppp 000 in d, p
11 011 ddd 00ppp 000 out p, d
Vector jump/call (e = entry in global vector table)
11 100 000 0 eeeeeee jv e
11 101 000 0 eeeeeee cv e
Load effective address (o = unsigned offset relative pc)
11 101 ddd 1 ooooooo lea d, label
Miscellaneous
11 111 ddd 00000 000 push d
11 111 ddd 00001 000 pop d
11 111 000 00010 000 nop
11 111 ddd 00011 sss mul d, s Store result in special product register
11 111 ddd 00100 000 stsp d Store d into stack pointer
11 111 ddd 00101 000 prod d, s Store s:d in product register
11 111 ddd 00110 000 jr d Jump to address in register
11 111 ddd 00111 000 cr d Call address in register
11 111 000 01000 000 ret
11 111 ddd 01001 000 wait d Wait for status bit (blitter done, vblank...)
11 111 ddd 01010 000 send d Transmit on debug UART
11 111 ddd 01011 000 ldsf d Load d from status flags
11 111 ddd 01100 000 stsf d Store d into status flags
11 111 ddd 01101 000 initv d Set global vector table address
000 blitter row
001 blitter column
010 blitter width
011 blitter height + start
100 blitter program
101 active video page [1..3]
110 synth register select
111 synth register data
And here is some example code, which implements signed multiplication
— the CPU only provides unsigned multiplication.
muls
; r2 * r3 -> r1:r0
; clobbers product register
mul r2, r3
in r1, 1
mov r0, r2
add r0, r0
bcc .muls_1
sub r1, r3
.muls_1
mov r0, r3
add r0, r0
bcc .muls_2
sub r1, r2
.muls_2
in r0, 0
ret
The demo is written in assembly language, so I obviously had to write my own
assembler. It's quite limited — for instance, values must be either
numeric constants or labels — but it was sufficient for my purposes.
Shader code, which will be described presently, is inlined with the rest of the
code and handled by the same assembler.
First shader running.
Blitter
The blitter is a coprocessor that executes a small shader program for each
pixel in a sub-rectangle of memory. The work is distributed across ten
identical shader cores, thus exploiting the parallel nature of the FPGA.
First, the CPU writes the address of some shader code into output
register 4. This instructs the blitter to start copying the shader from
main memory into local RAM blocks within each of the ten shader cores. The
first word contains the size of the shader, and is followed by that many
longwords (in little endian order) of shader instructions and data. Then, for
any number of rectangles, the CPU loads the row, column, width and height into output
registers 0 through 3, where the final write to register 3
starts the blitter operation. Before each operation, the CPU must ensure that
the blitter has completed the previous job, by waiting on a status bit.
The shader cores deal with 32-bit words (longwords). Each core has a
256-word memory, where execution starts at address 0. The instruction set
has a DSP-like flavour, because each instruction consists of several
sub-instructions that are executed simultaneously. There are eight 32-bit
registers, which are treated as 16.16 fixpoint numbers. Contrary to the
CPU registers, these are not general purpose. Registers r0 through r3 receive the
results of simple ALU operations (add, xor etc), r4 and r5 can
be used to hold values (and are primed with the current x and y
coordinates within the blitting rectangle), r6 contains the result of the
latest multiplication and r7 contains the result of the latest shader RAM
access. Of these, registers r0 through r5 keep their value unless it's
explicitly modified by an instruction, whereas r6 and r7 are volatile and get
clobbered unless you use them immediately after assigning them. Expressed in a
different way, registers r6 and r7 get written at every clock cycle, regardless
of whether there's an instruction in the shader assembly code describing what
to put into them.
Here's the shader instruction set:
Instructions come in two varieties:
: aop rd, ra, rb : mv rd, rs : mul ra, rb : ld ..., ...
1aaaaaaa aaaapppp ppccccrr rrrrrrrr
a = alu op,
000 dd aaa bbb register d becomes a & b
001 dd aaa bbb register d becomes a + b
010 dd aaa bbb register d becomes a - b
011 dd aaa bbb register d becomes a | b
100 dd aaa bbb register d becomes a ^ b
101 dd aaa bbb register d becomes a min b
110 dd aaa bbb register d becomes a max b
111 dd aaa bbb register d is read from global ram at col, row according to registers a, b
p = product op,
aaa bbb register 6 becomes signed fixed-point adjusted product of registers a, b
c = copy op,
0 sss register 4 is read from register s
1 sss register 5 is read from register s
r = ram op,
0 aaaaaasss register 7 is read from shader ram at aaaaaa00 + floor(register s)
10 aaaaaaaa register 7 is read from shader ram at a
11 dddaaaaa register 7 is trashed; register d is written to shader ram at 110aaaaa
: aop rd, ra, rb : endp rr : jsr xyz
0aaaaaaa aaaa---- ---sssss ssssssss
a = alu op, same as before
s = special op,
00000 -------- no operation
00001 -------- terminate with no pixel
00010 -----rrr terminate with pixel according to register r
00100 -------- store sign bits of all registers into rSign
00101 --sssttt r7 <- (rx[t] & 0xffff) ^ (rSign[sss]? 0 : 0xffff)
00110 iiiijjjj add signed integer i to r4 and j to r5
10aaa aaaaarrr jump to a if r >= 0
11aaa aaaaarrr jump to a if r < 0
Execution uses alternating fetch/execute cycles, where the execute part may be stalled when global ram is accessed.
00000000 00000000 00000000 00000000 is a nop instruction.
Here's an example shader for visualising the Julia set:
sh_julia
shader .end
:ld r7, .xmid
:sub r0, r4, r7 :ld r7, .ymid
:sub r1, r5, r7 :ld r7, .scale
:mul r6, r0, r7 :ld r7, .scale
:mov r0, r6 :mul r6, r1, r7 :st $d8, r4
:mov r1, r6 :ld r7, .initcount
:mov r3, r7 :mul r6, r0, r0 :st $d9, r5
:mov r4, r6 :mul r6, r1, r1
:mov r5, r6
.loop
; square z
:mul r6, r0, r1
:add r1, r6, r6
:sub r0, r4, r5 :ld r7, .c_re
; add c
:add r0, r0, r7 :ld r7, .c_im
:add r1, r1, r7 :mul r6, r0, r0
; determine length
:mov r4, r6 :mul r6, r1, r1
:mov r5, r6 :add r2, r4, r6 :ld r7, .limit
:sub r2, r2, r7 :ld r7, .step
:sub r3, r3, r7 :jpos r2, .break
:jpos r3, .loop
.break
:ld r7, .topcount
:sub r1, r3, r7 :ldd r7, .palette, r3
:mov r1, r7 :jpos r1, .bg
:emit r1
.bg
:skip
.xmid long $00a00000
.ymid long $00780000
.c_re long $fffff000
.c_im long $ffff8000
.scale long $00000300
.initcount long $00100000
shalign ; aligns to 4-longword address, for ldd instruction
.topcount long $000f0000
.step long $00010000
.limit long $00040000
long #000 ; the '#' encodes a colour into a longword
.palette
long #000
long #100
long #211
long #322
long #433
long #544
long #655
long #766
long #877
long #988
long #a99
long #baa
long #988
long #766
long #544
.end
A shader produces a single word of output, which gets stored at the
predetermined memory position for which the shader was executed. Alternatively,
the shader may choose to terminate itself without writing to memory. Writing is
done to the external SDRAM directly, bypassing the cache, because in most situations
the blitter will be constructing a frame buffer that will be consumed by the
VGA generator (which also accesses the SDRAM directly), so there's no need to
pollute the cache. However, when reading main memory, the blitter uses
the cache, because many pixel computations typically depend on the same data,
such as textures and the sine table. Sometimes (as in the shadebob effect), a
shader depends on data written by earlier blits. In these situations, the CPU
must invalidate the cache in between the blitter operations, in order to make
the output from earlier blits visible.
Synthesiser
The final part of the logic design is a 16-channel, 4-op FM synthesiser
with resonant low-pass filters on each channel, and a global echo facility.
Each channel is indepently controlled using 32 hardware registers,
arranged as follows:
00 osc 0 frequency, low word
01 osc 0 frequency, high word
02 osc 0 gain
03 filter cutoff
04 osc 1 frequency, low word
05 osc 1 frequency, high word
06 osc 1 gain
07 filter resonance
08 osc 2 frequency, low word
09 osc 2 frequency, high word
0a osc 2 gain
0b left fader
0c osc 3 frequency, low word
0d osc 3 frequency, high word
0e osc 3 gain
0f right fader
10 osc 0 amount of modulation from osc 0
11 osc 0 amount of modulation from osc 1
12 osc 0 amount of modulation from osc 2
13 osc 0 amount of modulation from osc 3
14 osc 1 amount of modulation from osc 0
15 osc 1 amount of modulation from osc 1
16 osc 1 amount of modulation from osc 2
17 osc 1 amount of modulation from osc 3
18 osc 2 amount of modulation from osc 0
19 osc 2 amount of modulation from osc 1
1a osc 2 amount of modulation from osc 2
1b osc 2 amount of modulation from osc 3
1c osc 3 amount of modulation from osc 0
1d osc 3 amount of modulation from osc 1
1e osc 3 amount of modulation from osc 2
1f osc 3 amount of modulation from osc 3
Each operator is based on a sine oscillator which is phase modulated by a
weighted sum of the (previous) output of each of the four operators. When an
operator modulates itself, the result is noise. The filter then receives a
weighted sum of the operators as input, and produces a mono output signal,
which is panned and attenuated by two faders (left and right) to produce a
stereo mix.
Channels 5 through 15 are connected to the echo buffer. This, as
well as the interrupt rate and hence the tempo of the song, is hardcoded in
the logic design, because there was no need to make it CPU-controllable for the
Parallelogram soundtrack. The echo facility has a small input FIFO and a small
output FIFO, but the bulk of the echo buffer is stored in main memory, which is
accessed by stalling the CPU just before it's about to fetch an instruction.
The left and right parts of the echo output are flipped and mixed into the
final sound signal, as well as fed back into the echo buffer.
The synthesiser, as described above, is only concerned with what goes on at
sample rate (44.1 kHz). The CPU then modifies these parameters
at control rate (100 Hz), in order to implement e.g. envelopes
for the operator modulation parameters. This playroutine also updates
some global variables reflecting the song position, the current bass drum level
and so on, which are then accessed by the visual effects.
C-One hooked up to a UART via an opto isolator.
Toolchain
Apart from the assembler mentioned above, I wrote a tracker which could
emulate the FM synthesiser. This allowed me to compose the music interactively
on my regular computer. Another tool converts the music data into binary data
that can be accessed by the demo, specifically by the playroutine executing
in the timer interrupt.
The assembled demo is compressed by a custom packer, and prepended with
decompression code. This becomes the demo binary, and is used as initial RAM
contents when compiling the FPGA core. However, during development, I didn't
want to recompile the logic design for every little change in the demo
software. After all, recompiling all the Verilog code and mapping it to the
FPGA takes approximately 40 minutes (with ten shader cores and the highest
optimisation settings). Hence, I placed a little bootloader in the UART
interrupt, and wrote a communication tool to send a demo binary over a serial
cable into the chip. The C-One (somewhat surprisingly) does not have a serial
port, so I just attached some wires to the mdb bus which is accessible
from the extender board.
Finally, to get a nice video capture, I designed a communication protocol
for transmitting compressed video frames from within the FPGA over the UART to
the computer, where they get uncompressed and stored as pnm files.
First I ran the demo in realtime, transmitting the current system time
whenever a frame was generated. This gave me a log of which frames were
actually present: it wouldn't be honest to present a video capture with a higher
frame rate than the actual hardware, and besides some of the effects are
stateful and depend on the timing of earlier frames. The demo was then restarted
in a non-realtime mode, where the host requests frames (using the log) and the
demo computes all effects according to the communicated timestamps rather than
the system clock.
Demo code
The demo itself is organised in a pretty straight-forward manner. As
mentioned, the first thing that happens is that the code is decompressed. Then,
the synthesiser is initialised and the screen displays a solid blue framebuffer
for a couple of seconds, to allow the monitor to synchronise. Then, the timer
interrupt is enabled, starting music playback. A mainloop reads out the current
song position and advances along a script, where the different parts of the
demo are described using code pointers (there's a song position, a setup
routine, and a per-frame routine).
Most effects calculate some per-frame parameters in the CPU, store the
resulting values right into a shader, load the shader into the blitter, then
blit. There are utility routines for common functionality, such as invalidating
the cache or computing A*sin(B*t+C) where t
is the global time.
Standalone extender board
Since the demo runs entirely on the extender board, the C-One mainboard
isn't necessary. To make the demo platform a bit more portable, I made my own
mainboard replacement. It contains a microcontroller for reading the core image
off an SD card and transmitting it to the FPGA at power-on, and it has a bunch
of discrete components doing digital-to-analogue conversion of the audio and
video signals.
However, the demo is fully C-One compatible, meaning that if you own a C-One you
can simply drop the core file into your machine and run it.
Final words
This project was quite a ride, as it basically involved learning Verilog,
FPGAs and hardware design. I did have some contact with FPGAs during my
engineering education, but in those courses we would just modify existing VHDL
code, and all the tricky parts had already been taken care of. Hardware bugs
are quite different from software bugs, and it was very frustrating and
rewarding to learn about all the gotchas the hard way. Looking back it has been
very enjoyable. Hopefully this will also inspire other people to learn new
skills and to build cool things!
Posted onsdag 11-apr-2012 22:03
Discuss this page
Disclaimer: I am not responsible for what people (other than myself) write in the forums. Please report any abuse, such as insults, slander, spam and illegal material, and I will take appropriate actions. Don't feed the trolls.
Jag tar inget ansvar för det som skrivs i forumet, förutom mina egna inlägg. Vänligen rapportera alla inlägg som bryter mot reglerna, så ska jag se vad jag kan göra. Som regelbrott räknas till exempel förolämpningar, förtal, spam och olagligt material. Mata inte trålarna.
Anonymous ons 11-apr-2012 23:31
niiccee. do you plan to release some more information regarding the bitbuf? I want to build one myself.
Anonymous tor 12-apr-2012 03:27
This is a little off topic, but could you write about the Symbolics keyboard you have?
Anonymous fre 13-apr-2012 02:15
Awesome stuff!
Anonymous fre 13-apr-2012 16:41
Super stuff!
/trc_wm
Anonymous fre 13-apr-2012 16:48
Is your cache write-through or write-back?
Anonymous fre 13-apr-2012 16:58
Loved it man, congratz!1
Anonymous lör 14-apr-2012 01:20
Very nice. Also good to see some new c-one content ;-) Now have to dig up my C-one board.
Anonymous lör 14-apr-2012 13:57
Hello Linus. Can you please make PCB board all in one (compatible with that board using C-One daughter board). And make simple FM music computer. And release FM Tracker as freeware or shareware. Whole Adlibtracker2 (OPL3) community will appreciate it. Tinctu@Gmail.Com
Anonymous lör 14-apr-2012 14:14
very nice stuff saw it live on Rev but came here to get the soundtrack as always
//bittin
utzig Fabio Utzig sön 15-apr-2012 03:08
Everytime I watch this it just blows my mind! I left a lot of questions on the comments section of the youtube video which you pretty much answered all here.
One thing which is not very clear is why you chose Verilog and not VHDL. You said you used VHDL (somewhat) at university so it seems to be a reasonable choice for me. I personally quite like VHDL and never used Verilog.
Also a very general question: how have you approached learning Verilog? Did you use books, sites, irc, whatever? Which ones?
I'm from Brazil and differently from Europe, especially Germany/Sweden, there's no demoscene (groups) here. I'll say it's even hard to find people who ever heard about it at all. I personally had an Amiga in the early 90s and that's the reason that at least I know about it. One thing which always impressed me is how cool this effects are and I have no idea how you learn to program them. I remember there used to be a site hornet.org or something like that has lots of tutorials. Can you please give some points to specific learning resources?
Could you have used --update_mif in Quartus to update your RAM contents instead of recompiling the whole project?
Anonymous ons 18-apr-2012 17:02
Awesome stuff bud. /Alfatech
Anonymous fre 11-maj-2012 14:54
QUOTE: "However, during development, I didn't want to recompile the logic design for every little change in the demo software. After all, recompiling all the Verilog code and mapping it to the FPGA takes approximately 40 minutes (with ten shader cores and the highest optimisation settings). Hence, I placed a little bootloader in the UART interrupt, and wrote a communication tool to send a demo binary over a serial cable into the chip."
Xilinx have "data2mem" for exactly this reason, but Altera is (was?) lacking in this regard.
Have you tried the following: quartus_cdb --update_mif
More at: http://dbaspot.com/arch/385565-modify-pof-new-esb-rom-content-print.html
Anonymous fre 11-maj-2012 15:00
Oh, and the Symbolics keyboard - be sure to join http://deskthority.net if you haven't already.
Anonymous mån 21-maj-2012 00:29
This is awesome! One question: For how long were you working on this project?
Anonymous tis 19-jun-2012 06:17
Very impressed. Thanks for posting this.
Anonymous sön 4-nov-2012 10:47
There's only one word to describe you. Genius.
Anonymous fre 9-nov-2012 17:34
We want the tracker and the OST in original format! Please... *w*
MP3 render is awful and full of random clicks and noises.
gbraad Gerard Braad mån 24-dec-2012 05:29
would love to have the schematics for the mini-board and source files. I think they can be helpful for the C-one and Turbo Chameleon community.
Anonymous mån 24-dec-2012 08:47
[quote="utzig"One thing which is not very clear is why you chose Verilog and not VHDL. Verilog is easier and quicker to get results...