Help Multiplexius become a real-world plushie!

We only have until September 26.

Music downloads
Video clips
Scene productions
Hardware projects

Parallelogram

Parallelogram is a demo running on the Commodore One extender board, which contains an Altera Cyclone III FPGA and an SDRAM chip. The logic design was made from scratch, including a homebrew CPU, FM synth and blitter with pixel shader support. The demo won the wild compo at Revision 2012.

Download

The demo also has a pouët page, of course.

Custom logic

The system is coded in Verilog and compiled used Altera's free toolset (Quartus Web edition). PLLs, multipliers and memory blocks are instantiated from within Quartus using so called megafunctions, but the rest of the project consists of plain Verilog files edited with Vim. I used gtkwave to simulate parts of the system when things didn't work, and sometimes that was very helpful.

The overall architecture is illustrated in the presentation video around the 1 minute mark: The CPU is in control of execution, and accesses the external memory through a 16 KB cache. Since I have no control over the initial contents of the SDRAM chip, the demo must be stored somewhere on the FPGA. I opted for a solution where the cache is preloaded with the demo binary at boot, marked as dirty. As other memory gets accessed, the demo gets written "back" into the SDRAM. This limits the demo to 16 KB.

Memory

The SDRAM has a 16-bit bus width, and this property permeats the entire design. Pixels are stored as a0rrrr0gggg0bbbb, where the a bit is a generic alpha bit that can be used freely by software. It conveniently coincides with the sign bit. The point of having zeroes between the fields is that it simplifies saturated addition of colours.

There's an embarrassing error in the text at the beginning of the demo, where it says that only 128 KB of external memory is used. In fact, the system uses 2 MB (1 megaword) of the SDRAM, which requires 20 address bits, but the CPU only has direct access to the first 128 KB because addresses are stored in 16-bit registers. Memory is treated as a rectangular grid of words, 2048 rows by 512 columns. The blitter uses row/column addressing, and has access to the entire 2 MB. Frame buffers are 320 by 240 pixels, and are stored as sub-rectangles occupying columns 0 through 319.

Memory map

(Feel free to skip ahead if you're not interested in this much detail...)

char in map = 8x16 pixels (words)

C = cpu memory with preloaded contents
c = unpacked executable
f = upper half is 64-character 8x8 font
s =
        $70 sine table
        $71 freq table
        $72 channel data
        $73 synth register copy
        $74 constant random table
        $75 raster bar table
        $7e stack
        $7f stack
1 = video frame buffer 1
2 = video frame buffer 2
3 = video frame buffer 3
w = workspace frame buffer (for post fx)
d, u, v = free memory for effect data
        e.g. smoke (density, x-vel, y-vel), front and back, 256x242
m = 32x32 texture map
e = echo buffer
. = kept zero at all times

 0                                       320     384          511  Row  CPU Address
 ----------------------------------------------------------------
|CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC| 000  0000
|                                                                | 010  2000
|cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 020  4000
|cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 030  6000
|cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 040  8000
|                                                                | 050  a000
|ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff| 060  c000
|ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss| 070  e000
|                                                                | 080
|                                                                | 090
|                                                            mmmm| 0a0
|                                                            mmmm| 0b0
|                                                                | 0c0
|                                                                | 0d0
|                                                                | 0e0
|.........................................                      .| 0f0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 100
|1111111111111111111111111111111111111111.       eeeeeeee       .| 110
|1111111111111111111111111111111111111111.       eeeeeeee       .| 120
|1111111111111111111111111111111111111111.       eeeeeeee       .| 130
|1111111111111111111111111111111111111111.       eeeeeeee       .| 140
|1111111111111111111111111111111111111111.       eeeeeeee       .| 150
|1111111111111111111111111111111111111111.       eeeeeeee       .| 160
|1111111111111111111111111111111111111111.       eeeeeeee       .| 170
|1111111111111111111111111111111111111111.       eeeeeeee       .| 180
|1111111111111111111111111111111111111111.       eeeeeeee       .| 190
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1a0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1b0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1c0
|1111111111111111111111111111111111111111.       eeeeeeee       .| 1d0
|.........................................       eeeeeeee       .| 1e0
|.........................................       eeeeeeee       .| 1f0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 200
|2222222222222222222222222222222222222222.       eeeeeeee       .| 210
|2222222222222222222222222222222222222222.       eeeeeeee       .| 220
|2222222222222222222222222222222222222222.       eeeeeeee       .| 230
|2222222222222222222222222222222222222222.       eeeeeeee       .| 240
|2222222222222222222222222222222222222222.       eeeeeeee       .| 250
|2222222222222222222222222222222222222222.       eeeeeeee       .| 260
|2222222222222222222222222222222222222222.       eeeeeeee       .| 270
|2222222222222222222222222222222222222222.       eeeeeeee       .| 280
|2222222222222222222222222222222222222222.       eeeeeeee       .| 290
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2a0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2b0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2c0
|2222222222222222222222222222222222222222.       eeeeeeee       .| 2d0
|.........................................       eeeeeeee       .| 2e0
|.........................................       eeeeeeee       .| 2f0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 300
|3333333333333333333333333333333333333333.       eeeeeeee       .| 310
|3333333333333333333333333333333333333333.       eeeeeeee       .| 320
|3333333333333333333333333333333333333333.       eeeeeeee       .| 330
|3333333333333333333333333333333333333333.       eeeeeeee       .| 340
|3333333333333333333333333333333333333333.       eeeeeeee       .| 350
|3333333333333333333333333333333333333333.       eeeeeeee       .| 360
|3333333333333333333333333333333333333333.       eeeeeeee       .| 370
|3333333333333333333333333333333333333333.       eeeeeeee       .| 380
|3333333333333333333333333333333333333333.       eeeeeeee       .| 390
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3a0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3b0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3c0
|3333333333333333333333333333333333333333.       eeeeeeee       .| 3d0
|.........................................       eeeeeeee       .| 3e0
|.........................................       eeeeeeee       .| 3f0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 400
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 410
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 420
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 430
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 440
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 450
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 460
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 470
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 480
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 490
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4a0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4b0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4c0
|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.       eeeeeeee       .| 4d0
|.........................................       eeeeeeee       .| 4e0
|                                                eeeeeeee        | 4f0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 500
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 510
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 520
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 530
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 540
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 550
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 560
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 570
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 580
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 590
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5a0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5b0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5c0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5d0
|ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5e0
|                                                                | 5f0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 600
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 610
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 620
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 630
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 640
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 650
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 660
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 670
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 680
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 690
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6a0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6b0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6c0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6d0
|uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6e0
|                                                                | 6f0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 700
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 710
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 720
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 730
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 740
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 750
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 760
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 770
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 780
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 790
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7a0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7b0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7c0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7d0
|vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7e0
|                                                                | 7f0
 ----------------------------------------------------------------

The cache is direct-mapped, which means that memory addresses where the low bits are identical will compete for the same cache entry. By placing data (e.g. textures) in columns 320 through 511, it will remain in the cache even when the frame buffer is accessed.

VGA

The VGA generator consists of a frontend and a backend. The frontend reads pixels directly from SDRAM and writes them to a FIFO. Since each rasterline is stored in a single SDRAM row, the entire rasterline can be read in one burst. Between the lines, the frontend backs off so other parts of the system can access the memory.

The backend runs in a separate clock domain. At vertical blanking, it sends an asynchronous signal back to the frontend to trigger a new frame, and then it reads 320*240 pixels from the FIFO. Each row is stored in a buffer and emitted twice, since the VGA signal has 480 rows.

The address of the frame buffer is CPU-controlled, and Parallelogram uses triple buffering.

CPU

The CPU was written from scratch. I considered using an existing design, but it was more fun to do it myself, and I was able to take advantage of the added flexibility. For instance, at one point the demo was slightly larger than 16 KB, but I could fix this by adding some new instructions and a new addressing mode in order to make the code compress better.

The CPU is not particularly fast, because most of the work is done by the pixel shaders. Hence, it is implemented without pipelining. There are eight general purpose 16-bit registers. Other registers include a program counter, a stack pointer, a 32-bit product register (accessed as a high and a low half) and status bits (zero and carry). These are accessed using special instructions.

Starting at address 0, there are three vector instructions, which are typically relative jumps: Boot, UART and timer. The boot instruction is executed at boot. The UART instruction is executed (after pushing the program counter) whenever a byte appears on the debug UART; this was used to load new code into the running system during development. The timer instruction gets executed (after pushing the program counter) every 10 ms, and controls music playback.

This is what the instruction set looks like:

Instructions

Move immediate high (d <- c * 32)
00 ccc ddd ccccc ccc    movih   d, c

Arithmetic/Logic
01 000 ddd 00000 sss    add     d, s
01 000 ddd 1cccc-ccc    addi    d, c
01 001 ddd 00000 sss    adc     d, s
01 001 ddd 1cccc-ccc    adci    d, c
01 010 ddd 00000 sss    sub     d, s
01 010 ddd 1cccc-ccc    subi    d, c
01 011 ddd 00000 sss    and     d, s
01 011 ddd 1cccc-ccc    andi    d, c
01 100 ddd 00000 sss    or      d, s
01 100 ddd 1cccc-ccc    ori     d, c
01 101 ddd 00000 sss    xor     d, s
01 101 ddd 1cccc-ccc    xori    d, c
01 110 ddd 00000 sss    cmp     d, s
01 110 ddd 1cccc-ccc    cmpi    d, c
01 111 ddd 00000 sss    mov     d, s
01 111 ddd 1cccc-ccc    movi    d, c

Branch (o = signed offset relative pc)
10 0 0001 oooooo-ooo    bgt     label
10 0 0011 oooooo-ooo    bne     label
10 0 0101 oooooo-ooo    bcc,bge label
10 0 1010 oooooo-ooo    bcs,blt label
10 0 1100 oooooo-ooo    beq     label
10 0 1110 oooooo-ooo    ble     label
10 0 1111 oooooo-ooo    bal     label

Subroutine call
10 1 0001 oooooo-ooo    cgt     label
10 1 0011 oooooo-ooo    cne     label
10 1 0101 oooooo-ooo    ccc,cge label
10 1 1010 oooooo-ooo    ccs,clt label
10 1 1100 oooooo-ooo    ceq     label
10 1 1110 oooooo-ooo    cle     label
10 1 1111 oooooo-ooo    cal     label

Memory
11 000 ddd ooooo sss    ld      d, s+o
11 001 ddd ooooo sss    st      s+o, d

I/O
11 010 ddd 00ppp 000    in      d, p
11 011 ddd 00ppp 000    out     p, d

Vector jump/call (e = entry in global vector table)
11 100 000 0 eeeeeee    jv      e
11 101 000 0 eeeeeee    cv      e

Load effective address (o = unsigned offset relative pc)
11 101 ddd 1 ooooooo    lea     d, label

Miscellaneous
11 111 ddd 00000 000    push    d
11 111 ddd 00001 000    pop     d
11 111 000 00010 000    nop
11 111 ddd 00011 sss    mul     d, s    Store result in special product register
11 111 ddd 00100 000    stsp    d       Store d into stack pointer
11 111 ddd 00101 000    prod    d, s    Store s:d in product register
11 111 ddd 00110 000    jr      d       Jump to address in register
11 111 ddd 00111 000    cr      d       Call address in register
11 111 000 01000 000    ret
11 111 ddd 01001 000    wait    d       Wait for status bit (blitter done, vblank...)
11 111 ddd 01010 000    send    d       Transmit on debug UART
11 111 ddd 01011 000    ldsf    d       Load d from status flags
11 111 ddd 01100 000    stsf    d       Store d into status flags
11 111 ddd 01101 000    initv   d       Set global vector table address

Input ports

000 product, low half
001 product, high half
010 status flags (blitter done, vblank...)
011 uart receive buffer
100 frame counter (global time)
101 benchmark timer

Output ports

000 blitter row
001 blitter column
010 blitter width
011 blitter height + start
100 blitter program
101 active video page [1..3]
110 synth register select
111 synth register data

And here is some example code, which implements signed multiplication — the CPU only provides unsigned multiplication.

muls
                ; r2 * r3 -> r1:r0
                ; clobbers product register

                mul     r2, r3
                in      r1, 1

                mov     r0, r2
                add     r0, r0
                bcc     .muls_1
                sub     r1, r3
.muls_1
                mov     r0, r3
                add     r0, r0
                bcc     .muls_2
                sub     r1, r2
.muls_2
                in      r0, 0
                ret

The demo is written in assembly language, so I obviously had to write my own assembler. It's quite limited — for instance, values must be either numeric constants or labels — but it was sufficient for my purposes. Shader code, which will be described presently, is inlined with the rest of the code and handled by the same assembler.

First shader running.

Blitter

The blitter is a coprocessor that executes a small shader program for each pixel in a sub-rectangle of memory. The work is distributed across ten identical shader cores, thus exploiting the parallel nature of the FPGA.

First, the CPU writes the address of some shader code into output register 4. This instructs the blitter to start copying the shader from main memory into local RAM blocks within each of the ten shader cores. The first word contains the size of the shader, and is followed by that many longwords (in little endian order) of shader instructions and data. Then, for any number of rectangles, the CPU loads the row, column, width and height into output registers 0 through 3, where the final write to register 3 starts the blitter operation. Before each operation, the CPU must ensure that the blitter has completed the previous job, by waiting on a status bit.

The shader cores deal with 32-bit words (longwords). Each core has a 256-word memory, where execution starts at address 0. The instruction set has a DSP-like flavour, because each instruction consists of several sub-instructions that are executed simultaneously. There are eight 32-bit registers, which are treated as 16.16 fixpoint numbers. Contrary to the CPU registers, these are not general purpose. Registers r0 through r3 receive the results of simple ALU operations (add, xor etc), r4 and r5 can be used to hold values (and are primed with the current x and y coordinates within the blitting rectangle), r6 contains the result of the latest multiplication and r7 contains the result of the latest shader RAM access. Of these, registers r0 through r5 keep their value unless it's explicitly modified by an instruction, whereas r6 and r7 are volatile and get clobbered unless you use them immediately after assigning them. Expressed in a different way, registers r6 and r7 get written at every clock cycle, regardless of whether there's an instruction in the shader assembly code describing what to put into them.

Here's the shader instruction set:

Instructions come in two varieties:

: aop rd, ra, rb : mv rd, rs : mul ra, rb : ld ..., ...

1aaaaaaa aaaapppp ppccccrr rrrrrrrr

a = alu op,
        000 dd aaa bbb          register d becomes a & b
        001 dd aaa bbb          register d becomes a + b
        010 dd aaa bbb          register d becomes a - b
        011 dd aaa bbb          register d becomes a | b
        100 dd aaa bbb          register d becomes a ^ b
        101 dd aaa bbb          register d becomes a min b
        110 dd aaa bbb          register d becomes a max b
        111 dd aaa bbb          register d is read from global ram at
                                  col, row according to registers a, b

p = product op,
        aaa bbb                 register 6 becomes signed fixed-point
                                  adjusted product of registers a, b

c = copy op,
        0 sss                   register 4 is read from register s
        1 sss                   register 5 is read from register s

r = ram op,
        0 aaaaaasss             register 7 is read from shader ram at
                                  aaaaaa00 + floor(register s)
        10 aaaaaaaa             register 7 is read from shader ram at a
        11 dddaaaaa             register 7 is trashed; register d is
                                  written to shader ram at 110aaaaa

: aop rd, ra, rb : endp rr : jsr xyz

0aaaaaaa aaaa---- ---sssss ssssssss

a = alu op, same as before

s = special op,
        00000 --------          no operation
        00001 --------          terminate with no pixel
        00010 -----rrr          terminate with pixel according to register r
        00100 --------          store sign bits of all registers into rSign
        00101 --sssttt          r7 <- (rx[t] & 0xffff) ^ (rSign[sss]? 0 : 0xffff)
        00110 iiiijjjj          add signed integer i to r4 and j to r5
        10aaa aaaaarrr          jump to a if r >= 0
        11aaa aaaaarrr          jump to a if r < 0

Execution uses alternating fetch/execute cycles, where the
execute part may be stalled when global ram is accessed.

00000000 00000000 00000000 00000000 is a nop instruction.

Here's an example shader for visualising the Julia set:

sh_julia
                shader  .end

                :ld     r7, .xmid
                :sub    r0, r4, r7      :ld     r7, .ymid
                :sub    r1, r5, r7      :ld     r7, .scale
                :mul    r6, r0, r7      :ld     r7, .scale
                :mov    r0, r6          :mul    r6, r1, r7      :st     $d8, r4
                :mov    r1, r6          :ld     r7, .initcount
                :mov    r3, r7          :mul    r6, r0, r0      :st     $d9, r5
                :mov    r4, r6          :mul    r6, r1, r1
                :mov    r5, r6
.loop
                ; square z

                :mul    r6, r0, r1
                :add    r1, r6, r6
                :sub    r0, r4, r5      :ld     r7, .c_re

                ; add c

                :add    r0, r0, r7      :ld     r7, .c_im
                :add    r1, r1, r7      :mul    r6, r0, r0

                ; determine length

                :mov    r4, r6          :mul    r6, r1, r1
                :mov    r5, r6          :add    r2, r4, r6      :ld     r7, .limit
                :sub    r2, r2, r7      :ld     r7, .step
                :sub    r3, r3, r7      :jpos   r2, .break

                :jpos   r3, .loop
.break
                :ld     r7, .topcount
                :sub    r1, r3, r7      :ldd    r7, .palette, r3
                :mov    r1, r7          :jpos   r1, .bg
                :emit   r1
.bg
                :skip

.xmid           long    $00a00000
.ymid           long    $00780000
.c_re           long    $fffff000
.c_im           long    $ffff8000
.scale          long    $00000300
.initcount      long    $00100000
                shalign                 ; aligns to 4-longword address, for ldd instruction
.topcount       long    $000f0000
.step           long    $00010000
.limit          long    $00040000
                long    #000            ; the '#' encodes a colour into a longword
.palette
                long    #000
                long    #100
                long    #211
                long    #322
                long    #433
                long    #544
                long    #655
                long    #766
                long    #877
                long    #988
                long    #a99
                long    #baa
                long    #988
                long    #766
                long    #544
.end

A shader produces a single word of output, which gets stored at the predetermined memory position for which the shader was executed. Alternatively, the shader may choose to terminate itself without writing to memory. Writing is done to the external SDRAM directly, bypassing the cache, because in most situations the blitter will be constructing a frame buffer that will be consumed by the VGA generator (which also accesses the SDRAM directly), so there's no need to pollute the cache. However, when reading main memory, the blitter uses the cache, because many pixel computations typically depend on the same data, such as textures and the sine table. Sometimes (as in the shadebob effect), a shader depends on data written by earlier blits. In these situations, the CPU must invalidate the cache in between the blitter operations, in order to make the output from earlier blits visible.

Synthesiser

The final part of the logic design is a 16-channel, 4-op FM synthesiser with resonant low-pass filters on each channel, and a global echo facility. Each channel is indepently controlled using 32 hardware registers, arranged as follows:

00      osc 0 frequency, low word
01      osc 0 frequency, high word
02      osc 0 gain
03      filter cutoff
04      osc 1 frequency, low word
05      osc 1 frequency, high word
06      osc 1 gain
07      filter resonance
08      osc 2 frequency, low word
09      osc 2 frequency, high word
0a      osc 2 gain
0b      left fader
0c      osc 3 frequency, low word
0d      osc 3 frequency, high word
0e      osc 3 gain
0f      right fader
10      osc 0 amount of modulation from osc 0
11      osc 0 amount of modulation from osc 1
12      osc 0 amount of modulation from osc 2
13      osc 0 amount of modulation from osc 3
14      osc 1 amount of modulation from osc 0
15      osc 1 amount of modulation from osc 1
16      osc 1 amount of modulation from osc 2
17      osc 1 amount of modulation from osc 3
18      osc 2 amount of modulation from osc 0
19      osc 2 amount of modulation from osc 1
1a      osc 2 amount of modulation from osc 2
1b      osc 2 amount of modulation from osc 3
1c      osc 3 amount of modulation from osc 0
1d      osc 3 amount of modulation from osc 1
1e      osc 3 amount of modulation from osc 2
1f      osc 3 amount of modulation from osc 3

Each operator is based on a sine oscillator which is phase modulated by a weighted sum of the (previous) output of each of the four operators. When an operator modulates itself, the result is noise. The filter then receives a weighted sum of the operators as input, and produces a mono output signal, which is panned and attenuated by two faders (left and right) to produce a stereo mix.

Channels 5 through 15 are connected to the echo buffer. This, as well as the interrupt rate and hence the tempo of the song, is hardcoded in the logic design, because there was no need to make it CPU-controllable for the Parallelogram soundtrack. The echo facility has a small input FIFO and a small output FIFO, but the bulk of the echo buffer is stored in main memory, which is accessed by stalling the CPU just before it's about to fetch an instruction. The left and right parts of the echo output are flipped and mixed into the final sound signal, as well as fed back into the echo buffer.

The synthesiser, as described above, is only concerned with what goes on at sample rate (44.1 kHz). The CPU then modifies these parameters at control rate (100 Hz), in order to implement e.g. envelopes for the operator modulation parameters. This playroutine also updates some global variables reflecting the song position, the current bass drum level and so on, which are then accessed by the visual effects.

C-One hooked up to a UART via an opto isolator.

Toolchain

Apart from the assembler mentioned above, I wrote a tracker which could emulate the FM synthesiser. This allowed me to compose the music interactively on my regular computer. Another tool converts the music data into binary data that can be accessed by the demo, specifically by the playroutine executing in the timer interrupt.

The assembled demo is compressed by a custom packer, and prepended with decompression code. This becomes the demo binary, and is used as initial RAM contents when compiling the FPGA core. However, during development, I didn't want to recompile the logic design for every little change in the demo software. After all, recompiling all the Verilog code and mapping it to the FPGA takes approximately 40 minutes (with ten shader cores and the highest optimisation settings). Hence, I placed a little bootloader in the UART interrupt, and wrote a communication tool to send a demo binary over a serial cable into the chip. The C-One (somewhat surprisingly) does not have a serial port, so I just attached some wires to the mdb bus which is accessible from the extender board.

Finally, to get a nice video capture, I designed a communication protocol for transmitting compressed video frames from within the FPGA over the UART to the computer, where they get uncompressed and stored as pnm files. First I ran the demo in realtime, transmitting the current system time whenever a frame was generated. This gave me a log of which frames were actually present: it wouldn't be honest to present a video capture with a higher frame rate than the actual hardware, and besides some of the effects are stateful and depend on the timing of earlier frames. The demo was then restarted in a non-realtime mode, where the host requests frames (using the log) and the demo computes all effects according to the communicated timestamps rather than the system clock.

Demo code

The demo itself is organised in a pretty straight-forward manner. As mentioned, the first thing that happens is that the code is decompressed. Then, the synthesiser is initialised and the screen displays a solid blue framebuffer for a couple of seconds, to allow the monitor to synchronise. Then, the timer interrupt is enabled, starting music playback. A mainloop reads out the current song position and advances along a script, where the different parts of the demo are described using code pointers (there's a song position, a setup routine, and a per-frame routine).

Most effects calculate some per-frame parameters in the CPU, store the resulting values right into a shader, load the shader into the blitter, then blit. There are utility routines for common functionality, such as invalidating the cache or computing A*sin(B*t+C) where t is the global time.

Standalone extender board

Since the demo runs entirely on the extender board, the C-One mainboard isn't necessary. To make the demo platform a bit more portable, I made my own mainboard replacement. It contains a microcontroller for reading the core image off an SD card and transmitting it to the FPGA at power-on, and it has a bunch of discrete components doing digital-to-analogue conversion of the audio and video signals.

However, the demo is fully C-One compatible, meaning that if you own a C-One you can simply drop the core file into your machine and run it.

Final words

This project was quite a ride, as it basically involved learning Verilog, FPGAs and hardware design. I did have some contact with FPGAs during my engineering education, but in those courses we would just modify existing VHDL code, and all the tricky parts had already been taken care of. Hardware bugs are quite different from software bugs, and it was very frustrating and rewarding to learn about all the gotchas the hard way. Looking back it has been very enjoyable. Hopefully this will also inspire other people to learn new skills and to build cool things!

Posted onsdag 11-apr-2012 22:03

Discuss this page

Disclaimer: I am not responsible for what people (other than myself) write in the forums. Please report any abuse, such as insults, slander, spam and illegal material, and I will take appropriate actions. Don't feed the trolls.

Jag tar inget ansvar för det som skrivs i forumet, förutom mina egna inlägg. Vänligen rapportera alla inlägg som bryter mot reglerna, så ska jag se vad jag kan göra. Som regelbrott räknas till exempel förolämpningar, förtal, spam och olagligt material. Mata inte trålarna.

Anonymous
ons 11-apr-2012 23:31
niiccee.
do you plan to release some more information regarding the bitbuf?
I want to build one myself.

23 more comments hidden. Click to show all.

Anonymous
tis 16-okt-2018 12:28
It saddens me you still haven't released the music in its original format...
Anonymous
sön 23-maj-2021 18:38
It saddens me you still haven't released the music in its original format...
Yeah, even something as dumb as MIDI would be nice ;-;
Anonymous
fre 1-okt-2021 18:43
Projects like this give me hope about humanity. Genuine works of love and passion for the sake of fun and beauty.

Warmest regards from Hungary :^)
Anonymous
tis 15-feb-2022 07:35
I guess creating a port of the FM Synth/tracker would be out of the spirit of the project.

Personally once it leaves the FPGA platform, it becomes just another FM soft synth, and is not as interesting.