Throughout this series, we’re going to be implementing a BF compiler. This chapter introduces you to both a high-level overview of a compiler and to the BF language. As a result, we won’t be learning anything about C++ in this chapter (but everything we learn will be relevant to the next chapter). It’s okay to not get this chapter on the first go: we’ll be implementing the compiler over a great many chapters, so feel free to refer to this chapter as often as necessary.
Compilers are a fun1 technology to implement and require a wide variety of data structures and algorithms. Although we won’t be implementing part of the compiler in every chapter, we will be using it as a common and practical motivation, so that you can observe how features are used in a context that isn’t contrived and needs some design.
Objective (you will develop) | Outcome (you will be able to) |
---|---|
an understanding of how a compiler translates source code into target code |
|
an understanding of the rules of the BF programming language |
|
Thanks to Tom Kunc, Janet Cobb, David Ledger, Vagrant Gautam, Chandler Carruth, killerbee13, Kate Gregory, Luke D’Alessandro, Corentin Jabot, and John Filleau for their feedback on this chapter.
You may recall from Chapter 2 that a compiler is a program that translates a program written in one language into an equivalent program in another language. Compilers are typically comprised of three parts: a front-end, an optional middle-end, and a back-end. Applied Modern C++ isn’t a book on compilers, so we’ll sadly be skipping over a lot of interesting compiler theory. This chapter will give you an overview of these three parts, and then we’ll go into more detail about each part—along with a design—prior to implementing them in C++.
A compiler must preserve the meaning of its input, otherwise it wouldn’t make for a useful tool.
The front-end is responsible for reading the input (called the source program), understanding its meaning, and diagnosing any problems with the program. Since the front-end needs knowledge of the language it’s reading, it is very source-language specific. The front-end is broken down further into three distinct phases.
The first pass is called lexical analysis, or scanning. The scanner directly reads the source, groups characters into meaningful symbols called tokens, and then feeds this information to the next pass. You can think of this step like reading individual words and punctuation, and then categorising them as nouns, adjectives, verbs, punctuation, etc.. For example, the sentence “Failure is the greatest teacher.” has six tokens. A simple scanner might tokenise them as follows:
<kind = noun, spelling = "Failure", position = 1>
<kind = verb, spelling = "is", position = 2>
<kind = article, spelling = "the", position = 3>
<kind = adjective, spelling = "greatest", position = 4>
<kind = noun, spelling = "teacher", position = 5>
<kind = full stop, spelling = ".", position = 6>
The next chapter will cover lexical analysis in more detail as we will be writing a scanner to motivate the remaining C++ chapters for this module.
The second pass is called syntax analysis, or parsing. The parser is responsible for checking the program’s structure is correct according to a grammar. Unlike natural languages, which have flexible grammars, programming languages need to have a rigid grammar to help ensure that there aren’t any ambiguities in a program: it won’t be feasible for the compiler to ask for clarification. Here’s an incredibly simplified grammar for a sentence in English2.
sentence = subject verb object endmark
endmark = "." | "!"
subject = article? adjective? noun
object = article? adjective? noun
article = "the" | "an" | "a"
adjective
, noun
, and verb
have
their usual meanings. endmark
uses |
to
indicate alternatives: it’s either a full stop or an exclamation mark.
subject
and object
use ?
to
indicate something is optional: an article may be present, but
it doesn’t need to be. This grammar is compatible with the token stream
from our hypothetical scanner. Now suppose we tokenise “The greatest
teacher failure is.” Will this be a valid sentence
?
<kind = article, spelling = "The", position = 1>
<kind = adjective, spelling = "greatest", position = 2>
<kind = noun, spelling = "teacher", position = 3>
<kind = noun, spelling = "failure", position = 4>
<kind = verb, spelling = "is", position = 5>
<kind = full stop, spelling = ".", position = 6>
If we expand sentence
, we can—on inspection—identify why
“The greatest teacher failure is” is considered an invalid sentence by
our parser.
// subject expanded object expanded
sentence = (article? adjective? noun) verb (article? adjective? noun) endmark
Notice that we have two consectuive nouns in our second token stream, and that our grammar requires nouns to be split by a verb. As a result of this, the parser will issue a syntax error.
Once this stage is completed, the parser will hand over to the third
pass, for semantic analysis. The semantic analyser checks that
the program’s meaning is correct. This is often concerned with checking
things that aren’t expressible in the grammar. For example, the sentence
“The colourless idea is a wooden happiness” is a valid
sentence
according to the parser, but it fails semantic
analysis because ideas don’t have colour, and because ideas don’t
express emotions such as happiness3.
The semantic analyser is also responsible for canonicalisation, which transforms the program so that idempotent expressions are all expressed in the same way. For example, there are many ways to ask about the colour of a ball. Here are just a few:
Is the ball red or blue?
Is the ball red, or is it blue?
Is the ball blue, or is it red?
Is the ball red in colour, or blue in colour?
These are all asking the same question, so we pick one of the phrasings, and turn all other phrasings with the same meaning into that one. This is important for keeping intermediate representations simple, which we’ll look at later on.
Once the compiler established that the source program is correct and translatable, it needs to actually do said translation. The back-end is responsible for producing an output (called the target program). Similar to how the source program was written in a source language, the target program is written in a target language. For the purposes of this chapter, we’ll split the back-end into two main phases.
Source code is a tool used to communicate with humans and computers, and we should always be striving to write code that is correct first, easy-to-understand second, and efficient third. The optimiser’s role in translation is to take our readable code and make it more “efficient”. This could mean that the code is made faster, smaller, more power-efficient, or some combination of the three. For example, the following code is a questionable—but valid—way to set an integer to 5:
auto x = zero(); // x == 0
++x; // x == 1
++x; // x == 2
++x; // x == 3
++x; // x == 4
++x; // x == 5
If the compiler were to faithfully translate this code, it would be
quite slow due to the repeated increments. The optimiser would step in
and change it to auto x = zero() + 5
. If it can prove that
zero()
always returns 0
, then it is allowed to
simplify it further, to simply auto x = 5;
.
The last thing that the back-end will need to do is to produce the target program itself. We call this code generation. Compilers are often associated with generating assembly as their target program, though many will also generate some high-level source code as well. The code generator has expert knowledge about its target language, and uses this expertise to generate the best possible target program given the information it’s provided with.
Linux is designed to work on many different kinds of platforms: various distributions can run on x86-64 (desktops/laptops/servers), ARMv7 (mobile phones/some laptops/servers), ARMv8 (ditto), and MIPS (older gaming consoles), just to name a few. If we were to write a C++ compiler for each of these platforms with the workflow described above, we’d have to write out four C++ compilers (one for each platform). If we also want a Rust compiler, we’ll need to write another four, and another four to add a Swift compiler, and another four to add a C compiler (we require sixteen separate codebases now). If we plan to support the RISC-V chipset, we’ll need to do that for all of the languages, bringing us to 20 codebases. This results in a combinatorial explosion: if we wish to support L languages and P platforms, we’re going to need to write L ⋅ P compilers. That’s untenable!
Good software engineering practices allow us to avoid this problem by decoupling the front-end and the back-end. Recall that the front-end is an expert in the source language, and the back-end is an expert in the target language. Neither of them have any knowledge of the other, so we can put them into completely different modules. This means that there will be one front-end for each language and one back-end for each of the platforms we support. That reduces the number of components that we need to write from L ⋅ P down to L + P. In our hypothetical scenario, this means that we’d write a total of four front-ends and five back-ends: then we plug them together to create our set of 20 compilers.
In order to facilitate this separation of concerns, we need a way for the two components to talk with one another. This is called the intermediate representation. An intermediate representation is a data structure that serves as the glue between two different components, by offering an interface that both parties understand. There are many different kinds of intermediate representations that compilers use. We’ll explore this in some detail later on, but won’t need to get into the specifics terribly much for our own purposes.
This may seem like a tangent, but the idea of an intermediate representation comes up a lot in C++, and so it’s worth taking some time out to get an understanding of the concept.
Since we’re not going to be exploring compilers in detail, you may wish to read Engineering a Compiler if you find this project interesting.
We’re about to take a hard context switch into learning about the toy language we’ll be implementing, so now is a probably a good time to stretch your legs, grab a glass of water, and then read on.
BF4 is an esoteric programming language that’s extremely simple to implement, with only eight instructions. Due to this simplicity, BF is a popular language for implementing simple language translators.
No! You don’t need to know how to write programs in BF in order to complete this project. We’ll be using examples online (such as those found on Wikipedia) as our test programs. All that you need to learn are how the memory model and the instruction set work, so that you can translate the source code to target code.
BF programs have an array of at least 30kB readable and writable
memory at their disposal, and an index to tell us where in the array
we’re currently reading/writing. At program start, all of the values in
the array are zero. Since that’s quite a lot of cells, we won’t visually
represent it like we usually do: we’ll instead refer to the memory as
data[data_index]
.
As previously mentioned, BF has only eight instructions, which reside
in an instruction cache, which is indexed too. We’ll call this
instructions[instruction_index]
. After executing the
instruction at instruction[instruction_index]
, we increment
the instruction index.
Character | Instruction |
---|---|
> |
++data_index |
< |
--data_index |
+ |
++data[data_index] |
- |
--data[data_index] |
. |
fmt::print(data[data_index]) |
, |
Reads input from the user. |
[ |
If data[data_index] == 0 : set
instruction_index to the index of the matching
] + 1 (see below). |
] |
If data[data_index] != 0 : set
instruction_index to the index of the matching
[ + 1 (see below). |
Any character that isn’t an instruction is a comment.
Every [
must be matched with a following ]
,
and vice versa. This means that [
, [[]
,
]
, []]
, and ][
are all invalid
programs, because sometimes the brackets aren’t balanced, and sometimes
they’re in the wrong order. [[]]
on the other hand, would
be valid.
This is everything you need to know about BF in order to complete the project!
If you’d like to provide feedback regarding this series, please file an issue on GitHub.
If you’re interested in reading future chapters, subscribe to my RSS feed to receive a notification at the time of publication. If you’d previously subscribed to my feed on my old website (www.cjdb.com.au), please be sure to note the new domain!
In this chapter, we learnt about the five main phases that compilers enact, how they’re categorised, and what is used to facilitate their separation. Finally, we learnt about our project’s source language. As teased earlier on, we’re going to look at lexical analysis more closely in the next chapter, since it will directly relevant to everything left in the module. After that, it’s back to C++!
As a compiler engineer, I’m slightly biased on the fun part, but the broadness is fairly objective.↩︎
I really need to stress that this is a simplified grammar, and that anyone with a background in linguistics will notice that it allows for many invalid sentences in a real-world setting (e.g. “C++ is fun language”). Vagrant—one of my reviewers—is a linguist and was able to identify many problems with the syntax (some of which I wasn’t even aware of). Writing a true set of rules for natural languages is not the done thing anymore, and that’s way beyond the scope of this book (when I asked for further reading material, I was told that it can probably fill an entire course).↩︎
There’s also an argument that happiness can’t be “wooden”, although having “wooden enthusiasm” implies that someone is feigning their emotion, and I think that might apply to happiness too.↩︎
The full name of BF contains strong language, which may cause some folks distress to the point of not reading the text. In order to be as inclusive as possible, while still using this very simple language, I’m using the shorthand “BF” instead. If you’d like to read more about the language, you can do so on Wikipedia, but please be aware that it uses the unabbreviated name in the article.↩︎