Writing an Operating System for the 6502

A few months back some colleagues and I were trying to figure out the bounded resources on one of our systems. This is a typical exercise for anyone who has done devops, especially when you’re trying to scale up the system in question. You usually end up concluding that your system is either memory bound, I/O bound, or CPU bound which got me thinking (uh oh). The first two types of bounds are easy enough to understand. A machine’s I/O bus will have a limited throughput which can directly impact, say, how many web requests or db queries that system can handle per second. Likewise each of those requests will require some memory, and if the total amount required goes beyond what is available the machine will start to swap effectively introducing a new i/o bound.

But the concept of a “CPU bound” was more nebulous. What does it mean for a CPU to be bounded? Doesn’t a CPU just continue to execute autonomously? Isn’t it always just fetching, decoding, and executing instructions? And if so, how exactly does the presence of more or fewer threads affect the amount of time a given request will take to be serviced?

Today we’re going to try to answer these questions, not just in the abstract, but feel the answer in our bones. Our project will be a new spin on Ben Eater’s excellent series where he builds a 6502 computer system from scratch. We’re going to write the beginnings of an operating system for our little 6502. Specifically we’re going to write a timesliced thread scheduler so that it can do multi-threading despite only having one core.

In Part 1 and Part 2 we will go deep and understand what is the most primitive operation that allows a computer to do two things at once and how it works. In Part 3 we will go wide, jumping out of this one operation to understand concurrency, parallelism, throughput, and latency more generally. Insoding it is my hope that we will build a rich intuitition for how CPUs can handle multiple tasks at once and how their overall throughput is affected by CPU constraints.

Part 1: Assembling the System

The Basic Architecture

Let’s start by understanding our components and how they will interoperate to produce our system. This is going to be the briefest runthrough of Ben Eater’s series and where necessary I’ll note if I’ve done things differently.

The 6502 itself is our CPU and has a 16 bit address bus and an 8 bit data bus which are the interfaces that the chip uses to communicte with RAM, ROM, and I/O. ROM is where we’ll store our actual program and RAM provides stack space for our running process. To that end we have a memory mapper component (actually just a NAND gate chip) that will break up our full 16 bit address space into regions for accessing RAM, ROM, and I/O.

What this assembly allows us to do is have all three components (RAM, ROM, and I/O) be active on the same data bus, but only one of them will be enabled at a time depending on which address we’re accessing. The effective memory map we want from this is

Finally, the 6522 chip will not only act as the interface to our display, but also provide the timer necessary to interrupt the CPU (explained in Part 2). For this to work we have to tie the 6522’s IRQ pin (pin 21) to the 6502’s non-maskable interrupt pin 4.

This table lays out the full address wiring between RAM, ROM, I/O and the 6502 as a quick reference in case you want to reproduce this system

WE¯ CE¯/CS1 OE¯/CS2¯
RAM RW¯ a15 a14
ROM a15¯
I/O RW¯ a13 a15¯a14¯

Some Tricks of the Trade

To make the development loop faster while I was experimenting, I came up with a few tricks that I’ll share here if you want to reproduce these results or do your own experimenting.

First I found that while assembling a custom clock for the system gave a lot of flexibility, it came at the cost of a lot of set up and tear down whenever I wanted to pick this project back up. I found that buying a cheap pack of square wave generators functioned as a perfectly good substitute for a variable time clock circuit. In fact, looking closely at these I saw they basically had the same assembly (555 timer, potentiometers, and capacitors) as our custom rolled one.

Next, where possible I would recommend tying an LED from a pin to ground if want you to be able to debug issues between components more easily. I did this with the IRQ line from the 6522 so that I could see exactly when interrupts were supposed to happen (shown in the video above). Remember to put a resistor in series or else you may burn out your bulbs (I did this more than once 😅).

Finally I realized that the act of taking out and reprogramming the EEPROM in this circuit every time I wanted to test a new program was extremely cumbersome and error prone. So I discovered a way to simulate both RAM and ROM using my arduino mega, no dis/re-assembly necessary. Please reach out if you would like me to do a post on this.

Part 2: Writing the Scheduler

Now that we have our hardware in place it is time to program our CPU to create and jump between two threads. We’ll start by defining various constants, the special addresses at which we will write data and some of the more common values we will write to them.


ACR = $600B			; 6522 Auxiliary Control Register
DDRA = $6003			; 6522 Data Direction Register for Port A
DDRB = $6002			; 6522 Data Direction Register for Port B
IER = $600E			; 6522 Interrupt Enable Register
IFR = $600D			; 6522 Interrupt Flag Register
PORTA = $6001			; 6522 Register for Port A
PORTB = $6000			; 6522 Register for Port B
T1CH = $6005			; 6522 Timer 1 High Order Counter
T1CL = $6004			; 6522 Timer 1 Low Order Counter

E = %10000000 			; Enable bit for sending instructions to LCD
RW = %01000000
RS = %00100000


I’m also going to define a bunch of procedures which I’ll call the “stdi/o library” of our system. These procedures basically allow you to interface with the system’s display unit without knowing too many of the internal details of how it is wired or even what it is. Just that it can print characters, strings, numbers, &c.


reset_display:	
	lda #%00000001 		; Clear the display
	jsr send_display_instruction

	rts

send_display_instruction:
	sta PORTB
        lda #0			; Clear RS/RW/E bits
        sta PORTA               
        lda #E                  ; Set E bit to send instruction
        sta PORTA
        lda #0			; Clear RS/RW/E bits
        sta PORTA               
	rts
	
print_string:
	jsr reset_display
	stx $0000
	sty $0001
	ldy #0
print_string_next_char:
	lda ($00),Y
        cmp #0                  ; Compare the value to zero
        beq print_string_end
	lda ($00),Y
        jsr print_char
	iny
        jmp print_string_next_char
print_string_end:
        rts

print_char:
        sta PORTB
        lda #RS
        sta PORTA               ; Clear RS/RW/E bits
        lda #(RS | E)           ; Set E bit to send instruction
        sta PORTA
        lda #RS
        sta PORTA               ; Clear RS/RW/E bits
        rts

print_num_byte:
	and #$0f
	jsr print_num_nibble
	rts

print_num_nibble:
	cmp #10
	bcc print_num_digit
print_num_letter:
	adc #('a' - 11)
	jsr print_char
	rts
print_num_digit:
	adc #'0'
	jsr print_char
	rts

printf:
	;; we want printing a string to be an atomic operation
	;; so disable interrupts while this executes
	sei
	pha
	jsr print_string
	pla
	jsr print_num_byte
	cli
	rts


You can dive deeper into these functions if you are curious, but they are not essential for understanding the timeslicing work we are doing here. Now on to a bit of the meaty stuff. We are going to define two procedures to initialize our system, one to initialize the display and one to initialize the timer.


init_scheduler_timer:
	lda #%01000000
	sta ACR
	lda #%11000000
	sta IER
        lda #%00000000
        sta T1CL
        lda #%00001000
        sta T1CH
	rts

init_display:
        lda #%11111111          ; Set all pins on port B to output
        sta DDRB

        lda #%11100000          ; Set top 3 pins on port A to output
        sta DDRA

        lda #%00111000          ; Set 8-bit mode; 2-line display; 5x8 font
	jsr send_display_instruction

        lda #%00001110          ; Display on; cursor on; blink off
        jsr send_display_instruction

        lda #%00000110          ; Increment and shift cursor; don't shift display
	jsr send_display_instruction

The timer initialization is critical for getting the timeslicing working properly. We are basically doing three operations here.

  1. Set the second highest bit of the auxiliary control register. This will tell the 6522 to run the timer in a continuous loop, firing the IRQ line with a fixed cadence. Importantly we don’t set the highest bit as this would cause a timer output to also be fired on one of our display lines and mess with our output. (This was a huge pain to debug!)
  2. Set the highest two bits of the inerrupt enable register. Bit 7 means “I am setting bits, not clearing them” and bit 6 means “specifically set the bit to enable timer 1”
  3. Finally we’ll set the actual counter for timer 1 by setting both its low half and high half. Setting the 11th bit here means we will get an interrupt every 211=2048 cycles

Now let’s talk about what it will look like when this timer fires.







%0


cluster0

Switch to Thread 1


cluster1

Switch to Thread 0



Store 0's\nRegisters

Store 0's
Registers



Restore 1's\nRegisters

Restore 1's
Registers



Store 0's\nRegisters->Restore 1's\nRegisters





Return\nfrom\nInterrupt

Return
from
Interrupt



Restore 1's\nRegisters->Return\nfrom\nInterrupt





Thread 1

Thread 1



Return\nfrom\nInterrupt->Thread 1





Store 1's\nRegisters

Store 1's
Registers



Restore 0's\nRegisters

Restore 0's
Registers



Store 1's\nRegisters->Restore 0's\nRegisters





Return\nfrom\nInterrupt 

Return
from
Interrupt 



Restore 0's\nRegisters->Return\nfrom\nInterrupt 





Thread 0

Thread 0



Return\nfrom\nInterrupt ->Thread 0





Interrupt\nHandler

Interrupt
Handler



Interrupt\nHandler->Store 0's\nRegisters


   thread is 0                              



Interrupt\nHandler->Store 1's\nRegisters


   thread is 1



Interrupt\nHandler 

Interrupt
Handler 



Thread 1->Interrupt\nHandler 


  Timer fires.    



Thread 0->Interrupt\nHandler 


   Timer fires



Our CPU will be executing along in one of its two threads, thread 0 or thread 1. Our timer fires and this tells our CPU to pause whatever it is doing and start executing from the processor’s interrupt vector. From this location we have programmed the CPU to store all the context necessary to resume execution of our current thread in memory, restore the context of the other thread, and then resume as if it were on the other thread all along. The assembly for these operations looks like this.


	.org $8010
nmi:
	;; disable interrupts until we are done handling this one
	sei

	;; push a as we are about to overwrite it
	;; pull it back when storing the thread's context
	pha

	;; Load these registers from the 6522 to reset the timer
	lda IFR
	lda IER
	lda T1CL
	
	lda $0002
	cmp #0

	beq switch_to_thread_1

switch_to_thread_0:
	lda #0
	sta $0002

	pla
		
	sta $0007
	stx $0008
	sty $0009
	tsx
	stx $000a

	ldx $0006
	txs
	lda $0003
	ldx $0004
	ldy $0005
		
	jmp nmi_exit

switch_to_thread_1:
	lda #1
	sta $0002

	pla
	sta $0003
	stx $0004
	sty $0005
	tsx
	stx $0006

	ldx $000a
	txs
	lda $0007
	ldx $0008
	ldy $0009

	jmp nmi_exit

nmi_exit:
	;; re-enable interrupts
	cli
	rti

Note that for our purposes we have used a bunch of fixed memory locations to allow our scheduler to store/restore context. A real kernel would have its own dedicated stack and heap space so that it could build an arbitrary number of these structures dynamically. We also use the directive .org $8010 to tell our assembler to put this routine at memory location 0x8010 which we will later reference in our interrupt vector.

Memory Addresses Function
0x0002 ID of the thread we are currently executing
0x0003 - 0x0006 Context (registers a, x, y, and sp) for thread 0
0x0007 - 0x000a Context (registers a, x, y, and sp) for thread 1

With these foundations in place our logic for actually creating and running our threads should be straightforward


thread_0_string:
        ascii "Thread 0! i="

thread_1_string:
        ascii "Thread 1! i="


main:
	lda #0
	sta $0002

	ldx #$80
	txs

	; set up the top stack frame for thread 1 with
	; return address and status register
	lda #((thread_1 >> 8) & 255)
	pha
	lda #(thread_1 & 255)
	pha
	php

	tsx
	stx $000a

	ldx #$ff
	txs
	
thread_0:
	lda #0
loop_0:
	pha
	ldx #(<thread_0_string)
	ldy #(>thread_0_string)
        jsr printf
	pla
	adc #1
	jmp loop_0


thread_1:
	lda #0
loop_1:
	pha
	ldx #(<thread_1_string)
	ldy #(>thread_1_string)
	jsr printf
	pla
	adc #1
	jmp loop_1

Observe that each thread runs in a continuous loop using the a register to keep track of its loop index. On each iteration it runs printf to output its unique message combined with its loop index. One bit that needs some explaining is the set-up for our threads that happens in main. To ensure that the two threads don’t overlap with each other we’ve allocated different portions of our memory to each one.

Our 6502 by default uses the 0x01xx address space for stack memory, so our stack pointer register just remembers where in that space we are (the lower two xx nibbles). For our purposes we divide that space in half so that 0x0181 - 0x01ff will be thread 0’s stack and 0x0100 - 0x0180 will be thread 1’s. To accomplish this we store 0x80 at 0x000a (remember our table above?).

We also need to set up this context so that the 6502 can actually treat it like a stack when returning from the interrupt. Fortunately this is pretty simple. When the 6502 enters the interrupt vector it only stores three values necessary to be able to rti (return from interrupt), the upper and lower half of the return address as well as the status register, which we store onto the stack here.

Finally we have all the pieces to make our processor run and just need to call them in the right order when our processor starts from its reset vector


        .org $8000
reset:
	sei
	jsr init_display
	jsr init_scheduler_timer
	cli

	jmp main

Et Voilà! We have just written a very basic scheduler that will preëmptively switch between two threads of execution on our processor!

Part 3: Understanding CPU Bounds

Let’s take a moment to pause and reflect on what we have done here. We have taken a single-core processor, with a single fetch-decode-execute loop, and have gotten it to perform two tasks concurrently. That is pretty neat! Out of this simple principle operation almost all of modern computing has been possible. It is what lets your desktop operating system run both Chrome and Spotify at the same time and switch seamlessly between them. It is also what lets servers run both a web daemon and auxiliary services like Containerd.

Whatsmore it is an instance of a more general phenomenon called multi-plexing where we take any component and make it behave as if it is n independent components with 1n times as many resources each. Our multi-threading is an archetypal example of multi-plexing, but there are dozens more. With our archetype fresh in our minds we are going to spend this last section understanding the phenomenon of multi-plexing more generally and a bit more theoretically.

The Math of System QoS (Interlude)

Imagine you’re a barista at a coffee shop. Your job is to service customers which means doing three things: taking their orders, making their drinks, and bringing the drinks to the customers. When things are slow you might do all three jobs yourself in sequence. If a second customer happens to come while you’re making the first customer’s order, they will wait in line.

Now imagine your shop becomes very popular and people start lining up in dozens to get their coffee. How would you measure and improve your quality of service so that people aren’t needlessly waiting and complaining? An obvious first step would be to hire two friends and split up the work between you. One person can stand at the register, taking orders and charging for them while the second person makes the orders and the third brings them.

The reason it helps to add more people here is that each of these three “stations” is currently idle for 2/3 of the time while you’re busy doing one of the other two tasks. If you have a friend at the coffee machine while you stand at the register, you can start to take the next customer’s order while your friend starts making the drink for the first, and so on.

What we have just described is a form of concurrency known as pipelining where a service request goes through different distinct stages, each of which can be occupied with one request before the request moves on to the next stage. If we describe the latency of servicing a request, that is the end-to-end time it takes to service a requet, nothing has changed. However, our throughput or requests serviced per second increases.

Next imagine this still isn’t enough. We have customers lining up and getting mad because they can’t get coffee in time. So what do we do? Well we can’t just throw more people on top of the stations we have, because the stations are not idle anymore. So instead we duplicate our whole set-up. We add a second register, a second coffee machine, and hire three more friends to operate and service customers. You might have already guessed that we call this form of set-up parallelism because we now have two parallel streams of requests being serviced

Here we can see a visual distinction of concurrency vs parallelism. Concurrent requests follow the same stream through the same resources (literally “with the same current”), whereas parallel requests follow parallel streams. So if we want to relate all of our variables so far, we have

throughput=concurrencyparallelismlatency

In other words, if it takes one minute to service a customer end-to-end and we have two three-staged pipelines, then we can service six requests per minute total when we pipeline and parallelize. But there’s a catch! What if we didn’t perfectly divide up the work? What if it takes twice as long to deliver a drink as it does to take or make the order.

From this fact it follows that our throughput will be lower than we predicted bacause the cashier and barista will have to slow down to not overwhelm the poor runner who can only service two requests per minute, not three. So in fact we see that

throughput=parallelismbottleneck

where the bottleneck is the latency of the slowest stage in the pipeline. Thus our throughput is only four drinks per minute. For homogenous concurrent systems though there is no one bottleneck so our previous equation still holds (see yet another example below if you want to dive into this). We may look at our results here and scoff, “bah! This is easy,” we say, “we just need to double the number of runners.” Doing so would indeed solve our throughput issue. Or! What if we discover customers would actually prefer to get their drinks quicker and don’t mind waiting at the counter to do so? Now the runner only has to go half the distance and can service twice as many requests so we achieve our ideal six request throughput. Whatsmore we improved throughput by cutting down latency which will show you that the two quantities are indeed related and not independent.

By now, you’ve probably rightly guessed that we’re not just talking about drinks. Our stations could be the stages of a CPU pipeline or of a data pipeline or the tiers of a multi-layer service architecture. Let’s explore one last example before we break from our analogy

Finally imagine that we don’t get the advantage of whole number ratios for our latencies. What if it takes our barista 1.5 times as long to make the drinks as it does to take the order or deliver the drink?

Is the OCD optimizer in you tensing up in frustration here? Never fear for here we can get a little creative and start mixing up our stations. What we do is have three coffee machine stations for our two cashiers and two runners.

Now seven friends cooperate and hum in perfect sync as the cashiers and runners round-robin between the baristas. Or if you prefer, three layers of a service-oriented architecture such as a load-balancer, business layer, and database layer, all operate with perfect efficiency.

Back to CPUs

It is here that we return to our CPU and the requests it is trying to service. Consider that each request has some total n instructions we will need to execute to service the request. In fact we can quantify this measurement pretty easily using CPU profiling tools. If we write a basic python program to service HTTP requests

from http.server import HTTPServer, BaseHTTPRequestHandler


class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)


def main():
    port = ("localhost", 8080)
    print(f"Listening on {port}")
    server = HTTPServer(port, RequestHandler)
    server.serve_forever()


if __name__ == "__main__":
    main()

then we can run it with a Linux profiler (sudo perf stat) to get all sorts of useful data

 Performance counter stats for 'python3 preempt_benchmark_daemon.py':

            142.72 msec task-clock                #    0.034 CPUs utilized
             1,000      context-switches          #    7.007 K/sec
                 4      cpu-migrations            #   28.028 /sec
             2,221      page-faults               #   15.562 K/sec
       526,060,434      cycles                    #    3.686 GHz                      (83.06%)
        28,064,942      stalled-cycles-frontend   #    5.33% frontend cycles idle     (82.61%)
        77,189,115      stalled-cycles-backend    #   14.67% backend cycles idle      (82.30%)
       573,715,890      instructions              #    1.09  insn per cycle
                                                  #    0.13  stalled cycles per insn  (82.03%)
       124,632,929      branches                  #  873.298 M/sec                    (86.63%)
         4,389,381      branch-misses             #    3.52% of all branches          (83.36%)

       4.164917090 seconds time elapsed

       0.115070000 seconds user
       0.028767000 seconds sys

Running it a few times with a few different sizes of request loads (ab -n 10000 -c 20 "localhost:8080/") we can also account for the overhead of starting the programming runtime

Number of Requests Number of Instructions
1000 570M
2000 950M
10000 4B
20000 8B
50000 20B

and determine that it takes some 400k instructions to service a single request. We can think of cycles executing these instructions as resources available to our CPU, much like time making drinks is a resource available to our baristas which they both allocate towards their respective queues of requests. The difference with our CPU is the addition of pre-emption. That is, as we are progressing through the thread of a given request, that thread can be interrupted to make progress on another thread.

This addition mainly exists to cut back on the latency experienced by shorter programs or ones with user i/o. Otherwise, our CPU could also just construct a queue of requests and work through them in sequence. Indeed, in the long long ago this is how computers worked.

Here we can convince ourselves of an interesting observation. Multi-threading does not affect throughput. That is, no matter how many active threads we have or how frequentlty we are switching between them (setting aside overhead of context switching), our requests/sec stay the same. Take a moment to convince yourself of this and notice that our instructionsrequest remains fixed and our CPU’s instructionssecond remains fixed.

In our example it takes 400k instructions to service a request and our CPU can execute 4B instructions per second. So this works out to 10k requests per second. If we have a spike of 40k requests in the queue and are round-robining between our full set of requests once per second, it will take 4s before any one of them is fully serviced, but all of them will be serviced in those 4s. So it’s all about latency!

Or more specifically it’s about evening out latency. Remember our example of the two threads running on the 6502 and how each one “wants” to increment its counter and show the results on the display. The longer we make our interrupt period, the more of a headstart thread zero gets to increment further before we switch to thread one, but then thread one gets to catch up.

As we add more threads, we increase the time it takes any given one to increment and display, but the total number of increments and displays per second remains the same. You might think of it as the instructions that thread 0 gets to execute per second gets smaller, because the total IPS is fixed and its divided over a larger pot of threads.

So, what then is CPU bounding?! Well as we’ve calculated, our system can handle 10,000 requests per second. So what happens if it gets 10,001? Our CPU will keep chugging along, servicing requests and context switching between them, but all the meanwhile our queue of active requests is growing because input is greater than throughput. As this happens we see latency grow because our requests spend more and more time queued before bing serviced, until finally (hopefully) the load balancers and clients that sit in front of our service give up and time-out the request. More realistically, our clients then begin retrying their requests and compounding onto the problem. This nightmare of a phenomenon is what we call resource saturation in general or CPU bounding in particular, and if you’ve stuck it through this far in this post, firstly I applaud you, and second I hope you come away with a deep and intuitive understanding of how this phenomenon happens.

Modern CPUs and Concurrency

I would be remiss if I didn’t explain some caveats to everything we’ve discussed. The modern paradigm for building computer systems has gone through a ton of advancements since our little 6502 started chugging away in Apple II’s and C64s, but the principles all remain the same.

Remember how we made our first improvement to our coffee shop set-up by taking resources that were previously idle and staffing them to create a pipeline of operations. The same principle applies here, at many levels in fact because our CPU makes use of a literal hardware pipeline that works the same way, but as they say, “as above, so below”.

Firstly, modern CPUs are able to enter a state called low voltage mode where they are not only idle, but consume less power and dissipate less heat. A CPU enters this state when none of its threads has meaningful work to do (e.g. they’re sleeping or waiting on i/o) and remains in this mode until interrupted by a timer or a piece of i/o hardware. If you ever run htop and see a percent utilization of your CPU, that percent basically means how much time your CPU is spending not in an idle state.

Secondly, this idea that we need to spin up a whole new thread, or worse a whole process with a separate memory space, for each web request that comes in is pretty outdated. Starting in the 90s and throughout the 2000s it was common to take this approach with common gateway interface scripts, but as demands on web services increased, this paradigm needed to be usurped by something more efficient.

Enter green threads. As the name implies, green threads are like threads, however they are created and managed by a daemon process rather than by the kernel and so are lighter weight. Importantly, green threads are non-preemptive. They work by having an outer loop check (using the select syscall) if any of the green threads have work to do and scheduling each thread until, by its own agency, the thread yields control back to the scheduler.

This approach works just fine (usually) because all the green threads are part of the same system and are designed to cooperate with each other, but notice if a malicious author did get a hold of the system they could spin up one infinitely looping green thread per CPU core and it would take down the whole system 😱 Interesting tidbit here, the original UNIX system from which modern Linux and Apple systems descend was completely non-preemptive. Ken Thompson’s later invention, the goroutine, followed the same model in combination with the idea of communicating sequential processes.

Finally, in passing I will note an assortment of advancements in hardware architecture itself which have emerged to improve CPU throughput. Obviously, CPUs these days don’t have a single core anymore, but multiple cores that can execute in parallel. What isn’t obvious is the effect this approach can have on your IPS. If two cores write to the same memory address one will invalidate the other’s hardware cache which will add latency to reading that address in the future. As mentioned before, CPUs are built with hardware pipelining, like our coffee shop. So the CPU may try to start fetching a memory address while it does other computations in its pipeline, but if the fetch takes longer than expected, the pipeline will stall. Similarly, modern CPUs have branch prediction which allows them to proactively start executing future instructions assuming a branching operation goes one way. If the CPU is wrong though, all of those computations are invalid and have to be flushed.

I hope to cover each of these topics in detail in future posts so let me know which ones interest you the most! Until then…

Conclusions

it has been a journey! In this post we wrote a basic scheduler for an ancient piece of hardware and insodoing built a mental model for how process scheduling works at the most basic level. From there we emerged, gasping for air, and saw how our basic primitive generalizes to a bunch of situations and the math that describes those situations. I hope you now feel CPU scheduling as a phenomenon in your bones! The next time you have to analyze a system and see that it is CPU bound perhaps you will think of our 6502 and have the tools necessary to apply the right solution to improving your system QoS. Until next time!

Appendix

Yet Another Example

Still confused about concurrency vs pipelining? Is there really a difference? We have said that pipelining is a type of concurrency for heterogenous systems, those with multiple different kinds of stages, so what is an example of a homogenous system?

Imagine we lay a fiber line across the Atlantic, not inconceivable as this is how most internet traffic travels across the ocean. We communicate information through this medium by sending bursts of light, photons, across it. Setting aside conerns that a photon isn’t localized in space until it interacts with matter, as we discussed in Quantum Supremacy, we can imagine we send a photon for yes and no photon for no. Do we have to wait for the photon to travel the whole distance of the cable before sending its successor?

Goodness no! What a waste of resources that would be. To put it in the terms we have been using, at any given time the photon is only interacting with a small portion of the line while the rest of the line is idle. So we can fill up those idle bits by sending photons at the full throughput our line can support. That is concurrency because our photons go with the same current.

Can you guess what parallelism is? A fiber optic cable usually has more than one fiber strand. It will have many in parallel which can each independently carry a stream of photons.

A Personal Anecdote

Way back at the start of my career, one of the first production systems I worked on provided CI as a service for developers. We built this thing out with the zest and zeal of working at a start-up and shipped it with great relief… only to discover that when we hit a certain threshold of requests per second our latency would spike up to a minute!

No problem, we thought, something must be bottlenecking it so we just have to figure out what it is and throw more hardware at it. No dice! There were no CPU bounds, no I/O bounds, no memory bounds, no nothing bounds. Not a component in the system was saturated so wtf was going on?! It turns out we had a critical piece of code that was locking the same set of rows in our database on every request. Yikes.

What did this locking do? Can you imagine the scenario based on the conditions for higher system throughput we have been discussing? We increase throughput by interleaving operations through a component that otherwise remains idle, even when introducing this sort of concurrency has no effect on latency. In the case of a database, the dbms has to write a given mutating query to disk before it can tell the client that the query succeeded (see this previous post).

Using a standard rotating platter disk drive this operation could take 10ms which is an eternity for computers. It usually doesn’t matter because your OS or the device itself can queue up requests concurrently and find an efficient path to write blocks to disk to increaes throughput. However, if you are locking the row on the client side until your transation is done then it’s like your sending one photon at a time down your fiber line, waiting for the response, delivering it and then saying, “Ok. Let me move on to the next client. Wait why has my client been waiting so long?”

We might call this scenario “idle bounding” where none of our resources is saturated but we unnecessarily do operations in sequence that could be done concurrently. Thinking back to ourselves as lone baristas in our nascent coffee shop, idle bounding is like we never improved QoS at all!