A Parallel Graphics Processor

16 minute read

PARAGON (Parallel Architecture for Rendering and Graphics OperatioNs) is a graphics processor I designed for a class. Unlike some other class projects, this one was a lot of fun to build and so I've decided to write about it.

Overview

The final design consisted of 12 compute cores drawing to a 320x240 framebuffer. The framebuffer stores 8 bit pixels (RGB 332) and each compute core accesses a single pixel at a time. Internally, the compute cores are a (rather strange) 24 bit Harvard RISC architecture. The architecture was designed to be as minimal as possible while still being very usable. (More on this later.) Each core runs in parallel and, for the most part, does not need to know that any other core exists. The video memory is organized in a 3 level hierarchy, with two levels of caching and the framebuffer at the highest level. (Note, however, that the final design did not include the level 2 cache due to reasons below.)

Architecture

The compute cores are designed around a minimal RISC architecture with 16 bit fixed width instructions and 24 bit registers. The register file contains 16 registers indexed 0 through 15 where register 0 is the zero register (sinking all writes and sourcing 0) and register 15 is the core ID register (sinking writes and sourcing the ID of the current core). The core ID register is what truly allows for parallelism, as otherwise each core would run the exact same instructions in the exact same order. (This can cause glitches depending on the kernel; more on this later.)

To the untrained eye, a register which is always zero seems wasteful. If you need a zero, you can just load it in, right? The zero register is a hallmark of RISC architectures. In fact, the inclusion of this register makes the architecture smaller and simpler, although it also means the programmer has to be more aware of what they are doing. (In many cases, this can be handled by the assembler. e.g. mov rd, rs is translated as add rd, r0, rs.)

Instructions have 4 bit opcodes, meaning there are only 16 possible instructions. Fortunately, this limit was not reached in the final design. Most instructions have 3 operands which is typical of RISC architectures; although, some instructions use fewer operands.

The instruction set contains the following ALU operations: ADD, SUB, MUL, AND, OR, XOR, NOT, SH, ADDI. The first few are self explanatory. SH is the shift instruction which performs left, right, or arithmetic shifts on a register by an immediate (i.e. a constant which is not a register). ADDI is perhaps the most unusual instruction of the ALU operations as it is not strictly necessary. Simply, it adds an immediate to a register. Note, however, that the immediate is a signed 4 bit integer (ew). This is the only instruction to use a 4 bit signed immediate. Even though the instruction is absolutely not required, it makes writing programs much easier.

The processor's memory operations are limited, loading and storing only 1 byte at a time. The LD and ST instructions are the only two memory operations and they are used to access video memory (pixelwise). Because there are no other memory operations, the astute may wonder how the program memory is accessed. The answer to this is that program memory cannot be accessed. It is read only and only ever touched by the control system (i.e. a programmer cannot read or write to it). The LDI instruction is in a similar vein, although it does not read from memory. Instead it loads an unsigned 8 bit immediate into a register.

Finally, the architecture includes two control flow instructions. B is the branch instruction. It supports both conditional and unconditional branch and can either branch to a PC relative immediate or directly to an address stored in a register. Once again, some may recognize that something is missing. Two things actually. Mainly, there is no compare instruction. This seems to imply that all branching is unconditional. However, all ALU operations modify a set of flags: N, Z, and C. These flags store whether the output of the ALU is (N)egative (i.e. the MSB is set), (Z)ero, or if the operation generated a carryout. Using these flags, comparisons are typically created by subtracting two registers and discarding the result. The other thing that is missing is some sort of linked branch (i.e. function call). This is solved by the final control instruction AIPC, or add immediate to program counter. The instruction adds a signed 8 bit immediate to the program counter (PC) and stores it in a register. This is especially helpful for direct branches and e.g. function calls. The instruction is technically redundant since the PC is always known to the assembler, but it helps making functions easier.

Program Memory

As mentioned above, the program memory is used solely by the processor's control unit. Even though it is not as extensible as the video memory, the program memory is obviously a necessary part of the system. Separating the program memory (ROM) from the video memory (VRAM) also helped keep the architecture smaller and simpler (and what makes it a Harvard architecture).

One of the most important parts of a graphics processor is having it run different kernels. (It wouldn't be very interesting if it could only ever do one thing.) Because of this, the program ROM is reprogrammable. By enabling the programming mode on the ROM, instructions can be streamed to the graphics processor where they will be stored in the same order. Once programming is complete, the processor begins execution, starting at the first instruction.

But if the compute cores are independent of each other, how can they work with only one program ROM? Well, one solution to this is to just go core by core, handling each read request one at a time. Of course, this is horribly slow. The solution used by PARAGON is to copy the instruction ROM into multiple (individual) ROMs. Since the ROM is (as per the name) read only, there is no risk of processors modifying the instructions in the kernel. This means that every core can access the program ROM at the exact same time and not have to worry about waiting for other cores. On the other hand, this means that the program ROM is way bigger than it has to be. This is not a huge deal for PARAGON since the ROM is quite small (2048 instructions deep). I also believe that speed is a much more important factor in this case than memory utilization. (If read requests were handled one at a time, I'm inclined to believe that ROM would be 12x slower, meaning there wouldn't really be much benefit to having more than 1 core.)

Video Memory

The video memory (VRAM) is seemingly simple. Most of video memory (~60%) is allocated to the framebuffer (what you actually see on the screen). There is a bit of memory (~40%) beneath the framebuffer for general purpose use, although it is a bit limited since memory is accessed byte by byte. (Maybe in the future loads and stores can optionally work on multiple bytes.) Again, this doesn't seem too complex. But, what happens when multiple cores are trying to use VRAM at the same time? The program ROM solution can be applied here too. Just copy the VRAM for each core. This seems to keep VRAM fast like it did with the program ROM. Although, the program ROM was very small (4kB). Copying VRAM for 12 cores would cause it to be huge and absolutely impossible to accomplish. The real solution (and the one employed by PARAGON) is to have a caching scheme and memory arbitration.

The caching scheme is fairly simple (especially because of the way VRAM was actually implemented). Each core has a private write-though level 1 cache that stores 64 cache lines (each of which is 8 bytes wide; i.e. a total of 256 bytes of cache). The cache is directly mapped meaning indices in the cache are not unique. (This can lead to thrashing due to conflicting cache accesses, but I am treating it as a non issue for the sake of PARAGON.) When a core wants to access VRAM, the cache first checks if the memory has been cached already. If it has, great! On reads, just return the cache line. On writes, modify it and send it to the next level. When the memory has not yet been cached, the L1 cache requests the line from the next level and waits for it to come back. Although, what happens if a core writes to memory that is cached by another core? There are a lot of different coherency schemes that can be used to handle this but PARAGON employs the most simple: such cache lines are invalidated. This means that if one core tries to read cached memory that has been modified by a different core, it will have to fetch the modified line from the next level.

In the final design of PARAGON, the memory hierarchy does not include a level two cache. Instead, the memory is directly arbitrated between the private L1 caches and the full VRAM. The arbitration is fairly simple. The memory arbiter keeps track of the last core that it granted a request to. It then tries to find requests starting at the next core before wrapping back around. This means that one core will never be able to hold the memory arbiter forever.

Okay... Does it Even Work?

Yes! PARAGON was implemented on a Xilinx Spartan 7 FPGA on the RealDigital Urbana Board development board. The Spartan 7 is a decent low end FPGA with a fair bit of resources. Originally, I thought that my limiting factor would be LUTs. However, once I actually built the thing, I quickly found out that I was mostly limited by the RAM. The Spartan 7 (XC7S50) contains 75 RAMB36 blocks. These are synchronous, dual port, 36 kilobit SRAM blocks. There are four ways in which RAM is used by the processor: program memory, cache memory, video memory, and CPU memory. I haven't mentioned anything about a CPU insofar but it is a necessary part of the graphics processor. Of course, you wouldn't have a computer with a GPU but no CPU. The CPU used in the design is the Xilinx MicroBlaze processor. I stripped out as much as I could from it to make it small (and slow; the CPU performance is not an integral part of the graphics processor). The MicroBlaze seems to require about 16KB of memory including all the graphics demos, although I allocated 64KB just to be safe. All in all, the biggest memory consumer is the VRAM. This is because it is implemented entirely on the on-chip memory. The dev board does come with 128Mbits of DDR3 memory, but I had a lot of difficulty trying to get it to work and I eventually scrapped that idea. The upside to using the on-chip memory is that it is very fast: the memory has a 1 clock cycle read latency, much faster than however many clock cycles it would take for DDR3 to be ready. This is what led to the omission of an L2 cache. If the large, slow DDR3 had been used, the L2 cache would likely be necessary.

So how fast is it? Well, each compute core runs at 50MHz. Originally the plan was to run them at 100MHz but that has proven to be impossible for two reasons. Firstly, the processor cores are not particularly efficient; that is, they are not pipelined and expect all data to be ready in the same clock cycle. This is not scalable, however, and initially caused me to lower the clock frequency to 50MHz. (This might seem like an oversight on my part, but remember that this was a class project and I was on a time limit. Pipelined processor design is not the easiest and would greatly increase the development time.) The other limiting factor is physical. The processor uses logic cells from all over the chip. Physically, routing signals from one corner of the chip to the next is slow. I'm sure I could bump up the frequency a little but 50MHz is a nice round number.

Demos

I designed a bunch of demos to showcase PARAGON's processing power. Since the architecture contains only integer arithmetic and no special graphics operations, kernels are limited to fixed point arithmetic and manually writing out necessary functions (like matrix multiplication or dot product). (Floating point is technically possible using a software floating point library, but I haven't bothered to try implementing one as it is likely very difficult.) Surprisingly, even though kernels were limited to fixed point, I was still able to get some good looking demos.

The following video shows the processor running the Rule 30 elementary cellular automaton. The simulation is artificially slowed and runs on only a single core. Yes, this is intentional. The simulation was otherwise way too fast and did not look good.

You might notice the strange purple bar on the left side of the screen. This is caused by the (soft) HDMI encoder that is used to generate the video output. Can this be fixed? Probably. Is it a big deal? I don't think so. And so it stays. The demo looks cool (in my opinion) but it doesn't showcase the real power of the processor. For that, I wrote a Mandelbrot set rendering kernel. The Mandelbrot set is an "image" that is found by computing the rate of divergence of the equation z_next = z_prev^2 + c. I don't want to talk too much about the math, but the kernel just computes the formula a bunch of times and sees when (or if) the variable eventually blows up (i.e. gets really big in magnitude). The demo shows the computation twice: once on a single core and again but using all 12 cores.

Clearly, the single core rendering is very slow while the multicore rendering is quite fast. Even though cores still have to wait their turn to write to memory, the heavy computation involved means that the memory bus sits mostly idle and so there are few conflicts between cores.

Another interesting demo is Conway's Game of Life. The GoL is another cellular automaton but produces more interesting patterns than Rule 30. The demo uses Rule 30 as a pseudorandom initial state and then simulates the Game of Life.

The demo runs the Game of Life on multiple cores. This is not because the simulation is computationally complex, but instead because it requires heavy memory usage. For each pixel, the kernel fetches the 8 surrounding pixels to compute the next state which is stored in the right half of the framebuffer. Once all cells have been computed, the next state buffer is copied back to the left half of the framebuffer so that the next computation can begin. I have also run this demo on a single core and it is abysmally slow. So, even though it is a memory heavy kernel, the cores are still able to share the memory bus effectively, thereby increasing the simulation speed.

The final interesting demo is only half baked (visually and computationally, but still interesting). It rotates the vertices of a cube by applying a rotation matrix to them. The demo is also artifically slowed and single core since it did not look good at true computational speed.

Again, due to time constraints, I did not make the cube visually appealing (edges or face coloring). However, the video clearly shows a rotating cube. Eventually, the cube begins to fall apart and this is due to inaccuracy in the rotation matrix and vertex coordinates. Because they are fixed point, only so much accuracy can be encoded. To preserve some accuracy, the matrix values were precomputed and somewhat optimized (I checked a lot of different values by hand and the one in this demo is one of the best ones I found). The kernel does not re-normalize the vertices, even though it should, once again due to time constraints.

The rotating cube demo is also single core for another reason (which is hinted at above). The cube's vertices are being stored in VRAM right below the framebuffer (so they are not visible in the video output). Because of this, if multiple cores are trying to read the vertices to draw them to the screen and then modify the vertices and store them back, the coordinates will end up being corrupted and too many vertices will be drawn to the screen. The real fix to this is to have each core work on an independent set of vertices and synchronize the cores before clearing the screen. I did not elect to do this since it adds additional complexity and still would look bad due to being much too fast.

Below are some more fun renderings.


Blurred Mandelbrot Set	Blurred Rule 30 Simulation

Julia Set 1	Julia Set 2

Julia Set 3	Burning Ship Fractal

All these demos were painstakingly written by hand in PARAGON assembly.

What Now?

Even though it clearly works, PARAGON is lacking some features. The inclusion of large DDR3 memory would be very useful as it would allow for an increased framebuffer resolution and also add a ton more memory for general purpose use (e.g. double buffering). Since VRAM is byte addressable (at least as far as the compute cores are concerned), it is difficult (or at least annoying) to store large 24 bit values from the registers into memory. One fix to this might be to add a private scratchpad where cores may store general purpose memory, function call stacks, etc. Currently, recursion or functions calling functions is very difficult to implement due to the byte addressing limitation.

The processors could also use some form of hardware synchronization. A simple solution is adding a SYNC instruction. Of course, this instruction could be shared with other instructions to form a set of processor control instructions. Hardware synchronization would allow for writing kernels where cores must interact with each other but will likely diverge in execution without using clunky software synchronization.

Fixed point arithmetic is horrible to work with. Poor precision, rounding errors, etc. The addition of a floating point unit would greatly increase the ability to write high quality, high precision kernels. Of course, FPUs are huge, so there would likely only be one or two shared between all the cores. Yes, this would be slow, but it's worth having high precision arithmetic.

All in all, I think I'm happy with how this turned out.