Here’s a fun question you can use at parties:
Can you think of a word with 5 consecutive vowels?
If you have really interesting friends, you might get some exotic answers like ‘zoaeae’ or ‘cooeeing’ or ‘jussieuean’. Any Industrial Engineers will quickly shout out “Queueing”!
So, what is Queueing Theory? It is the study of waiting lines. As in, I go to the grocery store and wait in line to check out. That kind of waiting line. If you thought waiting in line was boring, then I bet you’ll be thrilled with in-depth analysis of waiting in line! (Spoiler alert – its way more interesting than actually waiting in line 😀 )
When you want to apply queueing theory to study a system, you model your system in terms of Queues and Servers. A queue is a list of work items to do. A server is a thing that does the work. In a grocery store checkout line, each customer is a work item, and the cashier is the server. In computing, a queue may be a list of CPU tasks, and the server is a processor core. Or in the context of a web app, a queue may be a list of requests and the server is… well, the web server.
Modeling a Software System as a Queueing System
Queueing Theory and Computing have a rich shared history. The Erlang programming language gets its name from the inventor of Queueing Theory, Agner Krarup Erlang. Queueing Theory is applied to many areas of computing, particularly to scheduling.
When you model your Software System in terms of Queues and Servers, it can lead to clearer designs and make it easier to reason about the performance of your system. Why?
Explicit is better than implicit.
The Zen of Python
It is very easy for the work your software actually performs to get lost in a sea of code. It is common to leave many scheduling decisions up to frameworks or language runtimes. When you do that, many important decisions that affect system performance and user experience are implicit in the system. And not just low level scheduling decisions, but business-level scheduling decisions. Who gets served first? How long do they wait for a response? What do we do when their waiting time is in the 90th percentile or the 99th? We want to make those decisions explicit so that we can control them.
When we don’t control our systems, they control us.
Before we go too far, many software developers are familiar with queues in the context of data structures. Note that a queue in the context of a queueing model is not the same as the queue data structure. The data structure is concerned with efficient storage and access. You can think of the data structure as a particular implementation of the an abstract queue, which can often be useful for actually building your software system based on a queueing model.
Queues and Servers in your Computer
At a fundamental level, the queueing model notion of a ‘server’ in your computer is an individual processor core. When you write a computer program, all the work it is going to do is going to be done on a processor core. Your program gets compiled down to machine code, turned into processor instructions, and those processor instructions will be performed by the processor. The computer I am writing this on has 4 processor cores, so in a queueing model, we would say there are 4 servers available for any programs running on my computer. This is the fundamental basis for thinking about your software system as a queueing system. Everything else is abstractions built on top of that.
Unfortunately, when you write software, you typically aren’t programming against that model. You are programming on top of a language runtime and on top of an operating system. You do not have direct control over processor cores. You can’t program instructions like Do this work on core 3, do that work on core 4
. The operating system handles scheduling work on the processor cores. Your language runtime may have its own set of scheduling rules on top of that. Your programming language may or may not give you any interface to scheduling. This is an example of the principle of Information Hiding, and an example of Declarative Programming. Hiding the details of how work actually gets scheduled on your processors is a design decision with trade-offs.
There are huge benefits to this. You can write high level descriptions of what your program needs to do without needing to think about low level details. The actual work of scheduling your code to run is handled for you. You can say read this line from this file, extract this data, and write it to this other file
and the OS and language runtime can figure out all the thorny details to make that happen. The OS can prioritize important system operations that need to happen so your computer runs in good order and doesn’t crash while your program is happily computing the fibonacci sequence.
There are also downsides. It can be challenging to write very fast programs. If you don’t have direct control over how work gets scheduled, then you are at the mercy of someone else’s scheduling decisions. Your program isn’t the only thing running on your computer. There are a lot of other programs running at any given time. The operating system goes back and forth between them so everyone gets a chance for their instructions to run on the CPU. But if I have 8 processor cores, maybe I could say, “Alright OS, you get 7 for whatever you want. Give me 1 just for my program and nothing else.” You can do something like this with Thread Pinning. But what if this were just part of the interface provided by my high level programming language?
Most programs don’t need the level of performance that would require you to get into the details of how the OS is scheduling work. However, it is very common for software developers to use (and misuse) threads. Threads are an abstraction that represents the idea of parallel work. If you do work on 2 separate threads, then you can do work twice as fast, because that work happens in parallel. Well… not quite.
At a fundamental level, the thing doing the work is the processor core.
Threads don’t do work. Processor cores do work. Threads get scheduled on processor cores. They can take advantage of multiple cores, but its not a straightforward thread-per-core situation. If I run 1000 threads on 1 core, my program is not going to run 1000 times faster. If my program is doing computational work, and not I/O, it would be faster to just run 1 thread. If I have 2 threads, then the core is spending time switching between them instead of just doing the computations. If my program is doing a lot of I/O, I can get some performance benefits out of having more threads than cores. At some point, adding more threads degrades performance, and how many threads to use depends on the details of your program and computing environment.
Let’s bring this back to queueing models. Queueing theory is about modeling your system in terms of queues and servers so that you can optimize the performance of the system. Even though you may not be modeling your program directly in terms of the processor core and its work queue, thinking about your computer as a queueing system and thinking about your program in that context yields valuable insights about how to write your code. In the case of thread usage, you may want explicit control over how many threads your program creates, and what kinds of work is done on those threads. This may lead you to choose a different concurrency model, like an Actor Model. When writing a web server application, you may want to use non-blocking event model, like Node.js, instead of a thread-per-request model. You may want to separate the computational parts of your code from the I/O parts of your code so that you can use different parallelization policies, using patterns like the Functional Core, Imperative Shell. The LMAX Architecture takes advantage of that kind of separation to yield a high performance application.
Conclusion
Computer systems are just another system for doing work. We want them to be efficient, just like any other system.
Better mental models lead to better systems. Building better systems starts with using better models.