Series Project

Throughout this series, we’re going to be implementing a BF compiler. This chapter introduces you to both a high-level overview of a compiler and to the BF language. As a result, we won’t be learning anything about C++ in this chapter (but everything we learn will be relevant to the next chapter). It’s okay to not get this chapter on the first go: we’ll be implementing the compiler over a great many chapters, so feel free to refer to this chapter as often as necessary.

Compilers are a fun¹ technology to implement and require a wide variety of data structures and algorithms. Although we won’t be implementing part of the compiler in every chapter, we will be using it as a common and practical motivation, so that you can observe how features are used in a context that isn’t contrived and needs some design.

Chapter table of contents

Series Project

Objectives and outcomes

Objective (you will develop)	Outcome (you will be able to)
an understanding of how a compiler translates source code into target code	identify the three major components of a compiler. describe the five main compiler passes: scanning, parsing, semantic analysis, optimisation, and code generation. explain why an intermediate representation plays a key component in the design of a compiler.
an understanding of the rules of the BF programming language	describe the language’s memory model. describe the language’s “architecture”. identify all eight instructions. identify comment characters.

Acknowledgements

Thanks to Tom Kunc, Janet Cobb, David Ledger, Vagrant Gautam, Chandler Carruth, killerbee13, Kate Gregory, Luke D’Alessandro, Corentin Jabot, and John Filleau for their feedback on this chapter.

An overview of compilers

You may recall from Chapter 2 that a compiler is a program that translates a program written in one language into an equivalent program in another language. Compilers are typically comprised of three parts: a front-end, an optional middle-end, and a back-end. Applied Modern C++ isn’t a book on compilers, so we’ll sadly be skipping over a lot of interesting compiler theory. This chapter will give you an overview of these three parts, and then we’ll go into more detail about each part—along with a design—prior to implementing them in C++.

A compiler must preserve the meaning of its input, otherwise it wouldn’t make for a useful tool.

Front-end

The front-end is responsible for reading the input (called the source program), understanding its meaning, and diagnosing any problems with the program. Since the front-end needs knowledge of the language it’s reading, it is very source-language specific. The front-end is broken down further into three distinct phases.

Front-end phase 1: lexical analysis

The first pass is called lexical analysis, or scanning. The scanner directly reads the source, groups characters into meaningful symbols called tokens, and then feeds this information to the next pass. You can think of this step like reading individual words and punctuation, and then categorising them as nouns, adjectives, verbs, punctuation, etc.. For example, the sentence “Failure is the greatest teacher.” has six tokens. A simple scanner might tokenise them as follows:

<kind = noun,      spelling = "Failure",  position = 1>
<kind = verb,      spelling = "is",       position = 2>
<kind = article,   spelling = "the",      position = 3>
<kind = adjective, spelling = "greatest", position = 4>
<kind = noun,      spelling = "teacher",  position = 5>
<kind = full stop, spelling = ".",        position = 6>

The next chapter will cover lexical analysis in more detail as we will be writing a scanner to motivate the remaining C++ chapters for this module.

Front-end phase 2: parsing

The second pass is called syntax analysis, or parsing. The parser is responsible for checking the program’s structure is correct according to a grammar. Unlike natural languages, which have flexible grammars, programming languages need to have a rigid grammar to help ensure that there aren’t any ambiguities in a program: it won’t be feasible for the compiler to ask for clarification. Here’s an incredibly simplified grammar for a sentence in English².

sentence  = subject verb object endmark
endmark   = "." | "!"
subject   = article? adjective? noun
object    = article? adjective? noun
article   = "the" | "an" | "a"

adjective, noun, and verb have their usual meanings. endmark uses | to indicate alternatives: it’s either a full stop or an exclamation mark. subject and object use ? to indicate something is optional: an article may be present, but it doesn’t need to be. This grammar is compatible with the token stream from our hypothetical scanner. Now suppose we tokenise “The greatest teacher failure is.” Will this be a valid sentence?

<kind = article,   spelling = "The",      position = 1>
<kind = adjective, spelling = "greatest", position = 2>
<kind = noun,      spelling = "teacher",  position = 3>
<kind = noun,      spelling = "failure",  position = 4>
<kind = verb,      spelling = "is",       position = 5>
<kind = full stop, spelling = ".",        position = 6>

If we expand sentence, we can—on inspection—identify why “The greatest teacher failure is” is considered an invalid sentence by our parser.

//               subject expanded               object expanded
sentence = (article? adjective? noun) verb (article? adjective? noun) endmark

Notice that we have two consectuive nouns in our second token stream, and that our grammar requires nouns to be split by a verb. As a result of this, the parser will issue a syntax error.

Front-end phase 3: semantic analysis

Once this stage is completed, the parser will hand over to the third pass, for semantic analysis. The semantic analyser checks that the program’s meaning is correct. This is often concerned with checking things that aren’t expressible in the grammar. For example, the sentence “The colourless idea is a wooden happiness” is a valid sentence according to the parser, but it fails semantic analysis because ideas don’t have colour, and because ideas don’t express emotions such as happiness³.

The semantic analyser is also responsible for canonicalisation, which transforms the program so that idempotent expressions are all expressed in the same way. For example, there are many ways to ask about the colour of a ball. Here are just a few:

Is the ball red or blue?

Is the ball red, or is it blue?

Is the ball blue, or is it red?

Is the ball red in colour, or blue in colour?

These are all asking the same question, so we pick one of the phrasings, and turn all other phrasings with the same meaning into that one. This is important for keeping intermediate representations simple, which we’ll look at later on.

Back-end

Once the compiler established that the source program is correct and translatable, it needs to actually do said translation. The back-end is responsible for producing an output (called the target program). Similar to how the source program was written in a source language, the target program is written in a target language. For the purposes of this chapter, we’ll split the back-end into two main phases.

Back-end phase 1: optimisation (optional)

Source code is a tool used to communicate with humans and computers, and we should always be striving to write code that is correct first, easy-to-understand second, and efficient third. The optimiser’s role in translation is to take our readable code and make it more “efficient”. This could mean that the code is made faster, smaller, more power-efficient, or some combination of the three. For example, the following code is a questionable—but valid—way to set an integer to 5:

auto x = zero(); // x == 0
++x; // x == 1
++x; // x == 2
++x; // x == 3
++x; // x == 4
++x; // x == 5

If the compiler were to faithfully translate this code, it would be quite slow due to the repeated increments. The optimiser would step in and change it to auto x = zero() + 5. If it can prove that zero() always returns 0, then it is allowed to simplify it further, to simply auto x = 5;.

Back-end phase 2: code generation

The last thing that the back-end will need to do is to produce the target program itself. We call this code generation. Compilers are often associated with generating assembly as their target program, though many will also generate some high-level source code as well. The code generator has expert knowledge about its target language, and uses this expertise to generate the best possible target program given the information it’s provided with.

Intermediate representation

Linux is designed to work on many different kinds of platforms: various distributions can run on x86-64 (desktops/laptops/servers), ARMv7 (mobile phones/some laptops/servers), ARMv8 (ditto), and MIPS (older gaming consoles), just to name a few. If we were to write a C++ compiler for each of these platforms with the workflow described above, we’d have to write out four C++ compilers (one for each platform). If we also want a Rust compiler, we’ll need to write another four, and another four to add a Swift compiler, and another four to add a C compiler (we require sixteen separate codebases now). If we plan to support the RISC-V chipset, we’ll need to do that for all of the languages, bringing us to 20 codebases. This results in a combinatorial explosion: if we wish to support L languages and P platforms, we’re going to need to write L ⋅ P compilers. That’s untenable!

Good software engineering practices allow us to avoid this problem by decoupling the front-end and the back-end. Recall that the front-end is an expert in the source language, and the back-end is an expert in the target language. Neither of them have any knowledge of the other, so we can put them into completely different modules. This means that there will be one front-end for each language and one back-end for each of the platforms we support. That reduces the number of components that we need to write from L ⋅ P down to L + P. In our hypothetical scenario, this means that we’d write a total of four front-ends and five back-ends: then we plug them together to create our set of 20 compilers.

In order to facilitate this separation of concerns, we need a way for the two components to talk with one another. This is called the intermediate representation. An intermediate representation is a data structure that serves as the glue between two different components, by offering an interface that both parties understand. There are many different kinds of intermediate representations that compilers use. We’ll explore this in some detail later on, but won’t need to get into the specifics terribly much for our own purposes.

This may seem like a tangent, but the idea of an intermediate representation comes up a lot in C++, and so it’s worth taking some time out to get an understanding of the concept.

BF

What is BF?

BF⁴ is an esoteric programming language that’s extremely simple to implement, with only eight instructions. Due to this simplicity, BF is a popular language for implementing simple language translators.

Does this mean I’ll need to learn how to program in BF too?

No! You don’t need to know how to write programs in BF in order to complete this project. We’ll be using examples online (such as those found on Wikipedia) as our test programs. All that you need to learn are how the memory model and the instruction set work, so that you can translate the source code to target code.

Memory model

BF programs have an array of at least 30kB readable and writable memory at their disposal, and an index to tell us where in the array we’re currently reading/writing. At program start, all of the values in the array are zero. Since that’s quite a lot of cells, we won’t visually represent it like we usually do: we’ll instead refer to the memory as data[data_index].

Instruction set

As previously mentioned, BF has only eight instructions, which reside in an instruction cache, which is indexed too. We’ll call this instructions[instruction_index]. After executing the instruction at instruction[instruction_index], we increment the instruction index.

Character	Instruction
`>`	`++data_index`
`<`	`--data_index`
`+`	`++data[data_index]`
`-`	`--data[data_index]`
`.`	`fmt::print(data[data_index])`
`,`	Reads input from the user.
`[`	If `data[data_index] == 0`: set `instruction_index` to the index of the matching `]` + `1` (see below).
`]`	If `data[data_index] != 0`: set `instruction_index` to the index of the matching `[` + `1` (see below).

Any character that isn’t an instruction is a comment.

Every [ must be matched with a following ], and vice versa. This means that [, [[], ], []], and ][ are all invalid programs, because sometimes the brackets aren’t balanced, and sometimes they’re in the wrong order. [[]] on the other hand, would be valid.

This is everything you need to know about BF in order to complete the project!

Feedback

If you’d like to provide feedback regarding this series, please file an issue on GitHub.

If you’re interested in reading future chapters, subscribe to my RSS feed to receive a notification at the time of publication. If you’d previously subscribed to my feed on my old website (www.cjdb.com.au), please be sure to note the new domain!

Summary

In this chapter, we learnt about the five main phases that compilers enact, how they’re categorised, and what is used to facilitate their separation. Finally, we learnt about our project’s source language. As teased earlier on, we’re going to look at lexical analysis more closely in the next chapter, since it will directly relevant to everything left in the module. After that, it’s back to C++!

Chapter 5: Functions

Table of contents

Chapter 7: Lexical analysis

As a compiler engineer, I’m slightly biased on the fun part, but the broadness is fairly objective.↩︎
I really need to stress that this is a simplified grammar, and that anyone with a background in linguistics will notice that it allows for many invalid sentences in a real-world setting (e.g. “C++ is fun language”). Vagrant—one of my reviewers—is a linguist and was able to identify many problems with the syntax (some of which I wasn’t even aware of). Writing a true set of rules for natural languages is not the done thing anymore, and that’s way beyond the scope of this book (when I asked for further reading material, I was told that it can probably fill an entire course).↩︎
There’s also an argument that happiness can’t be “wooden”, although having “wooden enthusiasm” implies that someone is feigning their emotion, and I think that might apply to happiness too.↩︎
The full name of BF contains strong language, which may cause some folks distress to the point of not reading the text. In order to be as inclusive as possible, while still using this very simple language, I’m using the shorthand “BF” instead. If you’d like to read more about the language, you can do so on Wikipedia, but please be aware that it uses the unabbreviated name in the article.↩︎