Follow @caervs

Setting Up an Open Source FPGA

Hey folks, after a 3 year hiatus (in which time I found and married the love of my life!), we are back to it with The Lambda Scheme. Since the time I wrote my last post AI has gone from a hot space in the software world to a mainstream conversation topic around the family dinner table. What a world. So you would think that AI will surely be top of mind for us here on The Lambda Scheme. We'll get to that.

When I left things off here I was trying to make a neural net classifier for musical instruments based on their harmonic signatures. I never quite got it to work so I thought upon my return it would be good to build up to something that complex from some AI fundamentals. That's what we'll be doing in the next few posts.

In my next posts we will see how you can get a neural net running on different kinds of hardware to give it a speed boost. Doing so will give us a solid appreciation for how LLMs really work. As a part of my research for the next post I had to get an FPGA running which was a harrowing adventure. So in this post I'll share my findings for anyone who wants to do anything similar with FPGAs. FPGAs are fun and versatile so setting one up is an investment and we will for sure see them again in this blog.

For any of you who have never seen one before, an FPGA is basically a chip that can efficiently emulate any other digital hardware. When chip designers want to test out their designs, they do so on an FPGA before having the chips fabricated en masse. In college they were central in our advanced CPU design courses. Today we won't be doing anything quite so advanced. We are just going to set up a basic hello world. Since this is still rather difficult I made a post out of it. As is our fashion on the Lambda Scheme we'll also get a little philosophical and take the opportunity to really understand the broader mental models that go into getting an FPGA set up.

Commercial vs Open Source

But first a rant...

When I first started this project I figured I should go with the time tested big names in hardware so I ordered myself an Arty A7 board with a Xilinx chip on it. But seriously, just try getting this thing's toolchain running on Linux. After providing lots of personal details to make an account, I had to download a 54Gb (54 GIGABYTE!) toolchain from a CDN that was limited to single digit Mbps... only to discover hours later that among the myriad options I'd selected, I'd opted for the enterprise version of the toolchain which requires a paid license. FML. After getting the right version setting up the toolchain proved to be a configuration nightmare.

So after that first experience I sent that thing straight back and opted for a fully open source chip, the Radiona ULX3S, with a fully open source toolchain. Lean, easy to install, easy to use. God bless.

Since I didn't find a clear guide online for setting up the code scaffolding for the chip, think of this post as the missing tutorial with some general hardware knowledge sprinkled in for those who want to learn.

Basics of an FPGA

Before diving into our tutorial, let's talk a little bit about how FPGAs work, just enough to know how to use them correctly. If you ever took a course on computer architecture or read my previous post on Quantum computing this diagram will look familiar.

It represents a simple digital circuit using abstract building blocks called logic gates. In this example (taken from Boole's own original writings) we have a circuit to evaluate the Biblical dietary restriction "Clean beasts are those which both divide the hoof and chew the cud."

When digital circuits like our kosher box above have no state we call them "combinational" and we can fully capture their behavior with a truth table like this one

So now if we want to simulate this circuit, we don't need to simulate the logic gates that make it up and how they interract. Another approach is that we just store all of these possible outputs in memory and then just "look up" the one that corresponds to our input.

This approach while usually less efficient in hardware is more versatile because we can take the same "look up circuit" and reconfigure it to mimick the behavior of any combinatorial circuit. Arrange these "lookup tables" so that they feed into each other based on your configuration and you have yourself an FPGA!

Bigger grids of these circuits can mimic more complex designs. In theory this can be any combination of digital circuits, including a neural net. To prove this theory, I want to take a relatively small LLM likeTinyStories and emulate part of it on an FPGA.

So using this knowledge of our chip's internals, we can do some back of the envelope math to check what it will take. Our "tiny" model has a few variants, ranging from 1M parameters to 33M. As a rough estimate we can imagine mapping each parameter to a LUT, corresponding to one pairing of a layer weight with a layer input. Now our little FPGA only has 12k logic slices (explored later) so this sadly won't be enough for a full model. We'd need something beefier (and more expensive) like this commercial large-emulation chip and/or we'd need to take advantage of built-in add/multiply blocks on some FPGAs so we don't just burn general purpose LUTs for our neural net.

For our goals, what we can more realistically do is break up our LLM into parts and see if we can get some of those parts running on an FPGA. This will be a great opportunity for us to really learn the internals of an LLM. My suspicion is that a pipeline of FPGAs each doing some stage of the transformation could outperform a GPU, but we'll find out if that's right! Before that, we'll do some basic mucking around with LLMs in our next few posts and before that we'll continue with our basic set-up of our FPGA.

The Open Source Toolchain

To get an FPGA to do what you want you start with a digital circuit design that you want it to emulate. If you've ever used something like logisim before, the idea is similar, but instead of GUIs the pros use a hardware description language to specify the circuit. From here your steps are basically:

Convert the design into a bitstream that you can send to the ULX3S over its USB port
Send the bitstream onto the FPGA at which point it will start emulating your circuit

Yosys

Yosys is the tool you use to simulate circuit designs and convert them to bit streams.

Downloads are available from the official github page, once downloaded you can follow the instructions to add oss-cad-suite to your bin path and you should be able to see

$ yosys --version
Yosys 0.54+37 (git sha1 99f7d79ab, clang++ 18.1.8 -fPIC -O3)

OpenFPGALoader

Next openFPGALoader is how you will interface with the hardware itself and get your bitstream onto the FPGA. This should also come with the oss-cad-suite above. Confirm by running

$ openFPGALoader --Version
openFPGALoader v0.13.1

Some useful basic commands here for testing the operation of this tool

Command	Function
`--list-boards`	List the boards which `openFPGALoader` is able to support
`--list-cables`	List the USB adapters-jtag adapters that `openFPGALoader` supports
`--scan-usb`	Scan for boards to connect to
`--detect`	Connect to a board and get info

We'll use it here to do a sanity check and ensure we can connect to our board

$ sudo openFPGALoader --scan-usb
empty
Bus device vid:pid       probe_type manufacturer      serial product
001 087    0x0403:0x6015 ft231X     FER-RADIONA-EMARD D01205 ULX3S FPGA 12K v3.0.8

$ sudo openFPGALoader -b ulx3s --detect
empty
Jtag probe limited to 3MHz
Jtag frequency : requested 6000000Hz -> real 3000000Hz
ret 0
index 0:
        idcode 0x21111043
        manufacturer lattice
        family ECP5
        model  LFE5U-12
        irlength 8

Note you will need to run these with sudo because we haven't set up our udev rules yet which we'll do next!

Udev Rules

A udev rule is basically a Linux configuration that tells the operating system what to do when a device is connected. In this case we'll set up a udev rule that gives yourself permission to access the FPGA when its connected so that you don't need to keep running commands as root.

So start by getting your device's vendor and product info. In this case we'll list our USB devices and look for the FTDI USB Bridge and get the vendor and product ID from the output.

$ lsusb | grep Future
Bus 001 Device 087: ID 0403:6015 Future Technology Devices International, Ltd Bridge(I2C/SPI/UART/FIFO)

Plug these values into a new udev file that says the device can be accessed by anyone in the "plugdev" group

sudo tee /etc/udev/rules.d/99-ulx3s-ftdi.rules >/dev/null <<'RULES'
# ULX3S / FTDI access for non-root
SUBSYSTEM=="usb", ATTR{idVendor}=="0403", ATTR{idProduct}=="6015", GROUP="plugdev", MODE="0666", TAG+="uaccess"
RULES

Tell udev to load the new rule

sudo udevadm control --reload-rules
sudo udevadm trigger

and ensure you are in the right groups

sudo usermod -aG plugdev,dialout $USER

and now you should be able to run all of the openFPGALoader commands above without sudo permissions

Nextpnr and ECPack

Two more tools sit between yosys which takes your source code and starts to turn it into something you can put on an FPGA and openFPGALoader which sends the final bitstream to the FPGA.

nextpnr-ecp5 and ecppack each take an intermediate abstract representation of your design and add some specifics for your board to get the final right configuration.

nextpnr-ecp5 is what's called a "place and route" tool which decides which specific LUTs will be used to emulate the different components of your design and which routes between them will model the connections between the components.

ecppack takes that placement and encodes it as a stream of bits which can be sent one-by-one to the board in a way it understands to configure itself

The full pipeline looks like this

Code Scaffolding

Makefile

Ok let's tie all these tools together into a Makefile

Store this at Makefile

PACKAGE = CABGA381

top.json: top.v uart.v
    yosys -p 'read_verilog top.v uart.v; synth_ecp5 -top top -json top.json'

top.config: top.json ulx3s_12f.lpf
    nextpnr-ecp5 --json top.json --lpf ulx3s_12f.lpf --textcfg top.config --12k --package $(PACKAGE)

top.bit: top.config
    ecppack top.config top.bit

flash: top.bit
    openFPGALoader --board=ulx3s top.bit

.PHONY: flash-%

flash-%:
    @echo "🔁 Linking $*.v to top.v and flashing..."
    ln -sf $*.v top.v
    $(MAKE) flash

clean:
    rm top.bit top.config top.json

I've configured this makefile to support loading different designs onto the board depending on which one you select. If I for instance have a file called echo.v then I can run make flash-echo which will load this onto the board.

There are a handful of files which are generic configuration files for my board which will always be the same so really what I'm doing is swapping in one particular file (in this case echo.v) by symlinking it to top.v and then running the toolchain with top.v and my other files.

Let's go over these one by one.

Netlist JSON

top.json: top.v uart.v
    yosys -p 'read_verilog top.v uart.v; synth_ecp5 -top top -json top.json'

as mentioned the first stage in our pipeline is to create a generic netlist JSON. This file describes your design in abstract terms that are agnostic to your particular board but more dumbed down and easier to parse than raw verilog. It takes as input the top.v file with my main design as well as a uart.v module which is a general UART I'll be covering in the next section. Once you add these files you'll be able to run make top.json to get the netlist.

We can probe into this file to get a little more detail in its internals. We'll see it has two top-level structures for information

 $ jq keys < top.json
[
  "creator",
  "modules"
]

See what it specifies as the creator

$ jq .creator < top.json
"Yosys 0.54+37 (git sha1 99f7d79ab, clang++ 18.1.8 -fPIC -O3)"

count the distinct modules it produces

$ jq '.modules | length' < top.json
91

probe into one particular module

 $ jq '.modules.top | keys' < top.json
[
  "attributes",
  "cells",
  "netnames",
  "ports"
]

and even see how particular LUTs are configured

 $ jq '.modules.top.cells.led_reg_LUT4_C_1' < top.json
{
  "hide_name": 0,
  "type": "LUT4",
  "parameters": {
    "INIT": "1111000011001100"
  },

Pretty cool!

Config

The next step is what we call "place and route" - this is where we take the abstract set of components and connections we have in our JSON file and map them to specific LUTs and wirings in our FPGA. To do this, our place and route tool nextpnr-ecp5 needs to know the specifics of our board like what LUTs are available and what they're connected to.

The model name of the FPGA that's on the ULX3S board will look like LFE5U-XXF-6BG381C where XX is the number, in thousands, of LUTs in the chip (12, 44, or 84). The 6BG381C at the end tells us about the packaging of our chip. Specifically it says that underneath the chip there is a ball brid of 381 contact points and these are how we interface with the chip. Since these touch points are in a grid we can reference them by names like "H3" or "L1", meaning "column H row 3" etc. On the board, these contact points touch actual devices like buttons, lights, and USB interfaces which we want to be able to reference with more abstract names like "led3" or "wifi_gpio0" so we need something that tells the tool how to map from names to specific pins. This information all lives in a constrant file called an .lpf file.

I've included my full lpf file for the ULX3S here. In it you'll find lines like

LOCATE COMP "led[0]" SITE "B2";

That basically say that when we say in our HDL that something is connected to led[0] that means it's connected to touch point B2 on the chip (which on our board is then connected to LED 0). Putting the command all together we get the make rule

PACKAGE = CABGA381

top.config: top.json ulx3s_12f.lpf
    nextpnr-ecp5 --json top.json --lpf ulx3s_12f.lpf --textcfg top.config --12k --package $(PACKAGE)

where --12k says we're configuring for a twelve-thousand LUT chip, the CABGA381 package is used for a 381 touch point ball grid array packaging, and we provide our constraint file as ulx3s_12f.lpf.

Some example lines you might see in this file include

.tile R18C4:PLC2
arc: A5 V02S0101
word: SLICEA.K0.INIT 0011001100111100
enum: SLICEA.MODE CCU2

This config line targets a "tile" in the FPGA which is a grid of four "slices" SLICEA, SLICEB, SLICEC, and SLICED. We specifically target tile R18C4:PLC2 (random identifier). The arc directive establishes a connection from the A5 pin covered earlier to one of the tile's inputs. The word directive lets us configure an individual LUT with particular lookup values and the enum directive let's us set the mode for our slice (there are a few things they can be besides basic LUTs, don't worry about it too much).

Tying it All Together and Flashing

Alright that's enough horsing around with internals, let's get to getting something on the board. With all of these bits in place you can use the ecppack tool to turn a config file into a bit sequence that can be streamed to the fpga over USB to configure it.

top.bit: top.config
    ecppack top.config top.bit

and then openFPGALoader to actually do the streaming

flash: top.bit
    openFPGALoader --board=ulx3s top.bit

As a convenience, I wrote this top-level phony make target that wraps everything all together.

flash-%:
    @echo "🔁 Linking $*.v to top.v and flashing..."
    ln -sf $*.v top.v
    $(MAKE) flash

It basically lets you have a bunch of different designs in the same directory to play around with and then at build time will hot-swap in the one you want (via symbolic linking) and flash that onto the board. So I have for instance the hello.v program we'll go over in the next section and can run make flash-hello to flash it onto the board.

Hello, World!

And now the time is come. With all that investment it's time to get a little payoff (more to come in future posts!) and set our FPGA up with a simple hello, world! program. This program is going to read characters one at a time from the input USB stream and echo them back. It will also set the LEDs of the FPGA as it receives characters so we can see a little action on the board. While not the most exciting, this program will be an important step to building up more complex behavior down the line where we will need to stream tokens to/from our board if we want to use it to run LLMs.

So let's put in our echo program.

module top (
                        input            clk_25mhz,
                        output [7:0] led,
                        output           wifi_gpio0,
                        input            ftdi_txd,
                        output           ftdi_rxd
                        );

   assign wifi_gpio0 = 1;

   wire                                  clk = clk_25mhz;
   wire                                  reset = 0;

   wire                                  uart_txd_ready, uart_rxd_strobe;
   reg                                   uart_txd_strobe = 0;
   reg [7:0]                         uart_txd;
   wire [7:0]                    uart_rxd;
   reg [7:0]                         led_reg;

   uart #(.DIVISOR(2604)) uart_inst (
                                                                         .clk(clk),
                                                                         .reset(reset),
                                                                         .serial_txd(ftdi_rxd),
                                                                         .serial_rxd(ftdi_txd),
                                                                         .txd(uart_txd),
                                                                         .txd_ready(uart_txd_ready),
                                                                         .txd_strobe(uart_txd_strobe),
                                                                         .rxd(uart_rxd),
                                                                         .rxd_strobe(uart_rxd_strobe)
                                                                         );

   assign led = led_reg;

   always @(posedge clk) begin
            uart_txd_strobe <= 0;

            if (uart_rxd_strobe) begin
                 led_reg <= uart_rxd;
                 uart_txd_strobe <= 1;
                 uart_txd <= uart_rxd;
            end
   end
endmodule

Breaking down a couple of critical bits here we have

module top (
                        input            clk_25mhz,
                        output [7:0] led,
                        output           wifi_gpio0,
                        input            ftdi_txd,
                        output           ftdi_rxd
                        );

This top module is what yosys reads as the top-level description of our hardware (via the -top top CLI flag), sort of like a main function but for hardware. This top-level module can then pull in whatever other modules make up the rest of the design and tie them together as needed. In this case we include another component, a UART, which we'll cover in more detail below.

   uart #(.DIVISOR(2604)) uart_inst (
    clk(clk),
    .reset(reset),
    .serial_txd(ftdi_rxd),
    .serial_rxd(ftdi_txd),
    .txd(uart_txd),
    .txd_ready(uart_txd_ready),
    .txd_strobe(uart_txd_strobe),
    .rxd(uart_rxd),
    .rxd_strobe(uart_rxd_strobe)
  );

but for now think of it as our bridge to the USB interface. We'll write and read characters via USB and this component lets us stream those between our board and our computer.

   always @(posedge clk) begin
        uart_txd_strobe <= 0;

        if (uart_rxd_strobe) begin
             led_reg <= uart_rxd;
             uart_txd_strobe <= 1;
             uart_txd <= uart_rxd;
        end
   end

This says that we want a hardware block that is triggered on every clock edge to see if the UART has a character for us to read. If so, read the character in, display its bits with our LEDs so we can see it on the board, and write the character back. Now to get this to work we need a UART which can stream these bits. Let's talk a little about what this component is.

When I write characters to my board they get sent through the board's USB interface into a chip called its USB bridge. Chips like this are most commonly manufactured by Future Technology Devices International so we just call them FTDI chips for short. This chip handles all the complexity of communicating over USB and translates these interdevice signals into a much simpler protocol, in this case a transistor-transistor logic (TTL) serial signal.

Thanks to this approach, the component we need to write for our board is fairly simple. It looks like htis.

module uart #(parameter DIVISOR=40)(
  input clk,
  input reset,
  output serial_txd,
  input serial_rxd,
  input [7:0] txd,
  input txd_strobe,
  output txd_ready,
  output [7:0] rxd,
  output rxd_strobe
);
  // TX
  reg [15:0] tx_cnt = 0;
  reg [3:0] tx_bit = 0;
  reg [9:0] tx_shift = 10'b1111111111;
  reg tx_busy = 0;
  assign serial_txd = tx_shift[0];
  assign txd_ready = !tx_busy;

  always @(posedge clk) begin
    if (reset) begin
      tx_cnt <= 0; tx_bit <= 0; tx_busy <= 0; tx_shift <= 10'b1111111111;
    end else if (!tx_busy && txd_strobe) begin
      tx_shift <= {1'b1, txd, 1'b0};
      tx_busy <= 1;
      tx_cnt <= 0;
      tx_bit <= 0;
    end else if (tx_busy) begin
      tx_cnt <= tx_cnt + 1;
      if (tx_cnt == DIVISOR-1) begin
        tx_cnt <= 0;
        tx_shift <= {1'b1, tx_shift[9:1]};
        tx_bit <= tx_bit + 1;
        if (tx_bit == 9)
          tx_busy <= 0;
      end
    end
  end

   // RX
   reg [15:0] rx_cnt = 0;
   reg [3:0]    rx_bit = 0;
   reg [7:0]    rx_shift = 0;
   reg              rx_reading = 0;
   reg              rxd_strobe_reg = 0;
   assign rxd = rx_shift;
   assign rxd_strobe = rxd_strobe_reg;

   reg              last_rxd = 1;

   always @(posedge clk) begin
            rxd_strobe_reg <= 0;
            if (reset) begin
                 rx_cnt <= 0; rx_bit <= 0; rx_reading <= 0;
                 rxd_strobe_reg <= 0;
            end else if (!rx_reading) begin
                 if (!serial_rxd && last_rxd) begin
                        rx_cnt <= DIVISOR + (DIVISOR >> 1); // Wait a read cycle and a half to start reading
                        rx_bit <= 0;
                        rx_reading <= 1;
                 end
            end else begin
                 if (rx_cnt > 0) begin
                        rx_cnt <= rx_cnt - 1;
                 end else if (rx_bit < 8) begin
                        rx_cnt <= DIVISOR;
                        rx_shift <= {serial_rxd, rx_shift[7:1]};
                        rx_bit <= rx_bit + 1;
                 end else begin
                        rx_reading <= 0;
                        rxd_strobe_reg <= 1;
                 end
            end
            if (rx_ready) rx_ready <= 0;
            last_rxd <= serial_rxd;
   end
endmodule

Fair warning. I attempted a few iterations of having ChatGPT write this UART for me and every time it failed so I eventually just wrote it myself. Caveat emptor

And with that we have all the pieces in place so we go ahead and run

$ make flash-echo

which should put our echo circuit design onto the board. We'll go ahead and run

$ screen /dev/ttyUSB0 9600

which will set up a virtual terminal on our device that communicates directly over the serial line of the board at 9600 baud. This corresponds to the internal clock we set up for our board, so how frequently it will be reading in bits. And if all goes according to plan, you'll see the characters you write, written back to you! Exciting

Ok well for me it was very exciting, but we'll be getting to some of the meatier stuff in future posts. Until then, I hope you enjoyed this deep dive into hardware tools and feel you have a few more skills in your arsenal to build cool and interesting things. If this didn't work for you right off the bat (how could it not?) I'll leave you with some debugging tips I learned along the way.

Stay curious and stay tuned everyone!

Troubleshooting

Some troubleshooting tips based on my experience:

If your USB cable seems to be working (your board lights up) but your computer doesn't discover the board, consider you may be using a power-only USB cable and need to switch with one that also has data lines (I did not know this was a thing)
Use dmesg -w to monitor kernel logs when connecting your device - this can help you debug any other communication issues between your board and your computera
The READMEs for project trellis and nextpnr (specifically for ecp5) were great for helping me get the toolchain set up. At one point I tried using a container with the toolchain and if you know me you know I'm a big fan of containers, but ultimately abandoned this and went for a host-level installation.
Once you have the toolchain and can flash a design onto the device, but are struggling to get communication working, use the on-board LEDs to diagnose the issue. I had to use these a bunch by setting values for them to see what the board was receiving from the UART. It takes some ingenuity, but by encoding debug messages to yourself by turning lights on and off you can figure out a lot of what's going on inside the board.

Additional Resources

Some things to help you along the way