Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Basics of compiler design, Essays (high school) of Compiler Construction

Introduction to Compiler Design

Typology: Essays (high school)

2014/2015

Uploaded on 09/22/2015

usha.ambikapathy
usha.ambikapathy 🇮🇳

1 document

1 / 230

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Basics of Compiler Design
Chapter 1
Introduction
1.1 What is a compiler?
A compiler translates (or compiles) a program written in a high-level
programming
language that is suitable for human programmers into the low-level machine
language that is required by computers. During this process, the compiler
will
also attempt to spot and report obvious programmer mistakes.
1.2 The phases of a compiler
Since writing a compiler is a nontrivial task, it is a good idea to structure the
work.
A typical way of doing this is to split the compilation into several phases with
well-dened interfaces. Conceptually, these phases operate in sequence
(though
in practice, they are often interleaved), each phase (except the rst) taking
the
output from the previous phase as its input. It is common to let each phase
be
handled by a separate module. Some of these modules are written by hand,
while
others may be generated from specications. Often, some of the modules
can be
shared between several compilers.
Lexical analysis This is the initial part of reading and analysing the program
text: The text is read and divided into tokens, each of which corresponds to
a symbol in the programming language, e.g., a variable name, keyword or
number.
Syntax analysis This phase takes the list of tokens produced by the lexical
analysis
and arranges these in a tree-structure (called the syntax tree) that reects
the structure of the program. This phase is often called parsing.
Type checking This phase analyses the syntax tree to determine if the
program
violates certain consistency requirements, e.g., if a variable is used but not
declared or if it is used in a context that doesn’t make sense given the type
of the variable, such as trying to use a boolean value as a function pointer.
Intermediate code generation The program is translated to a simple
machineindependent
intermediate language.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Basics of compiler design and more Essays (high school) Compiler Construction in PDF only on Docsity!

Basics of Compiler Design

Chapter 1

Introduction

1.1 What is a compiler?

A compiler translates (or compiles) a program written in a high-level programming language that is suitable for human programmers into the low-level machine language that is required by computers. During this process, the compiler will also attempt to spot and report obvious programmer mistakes.

1.2 The phases of a compiler

Since writing a compiler is a nontrivial task, it is a good idea to structure the work. A typical way of doing this is to split the compilation into several phases with well-defined interfaces. Conceptually, these phases operate in sequence (though in practice, they are often interleaved), each phase (except the first) taking the output from the previous phase as its input. It is common to let each phase be handled by a separate module. Some of these modules are written by hand, while others may be generated from specifications. Often, some of the modules can be shared between several compilers. Lexical analysis This is the initial part of reading and analysing the program text: The text is read and divided into tokens, each of which corresponds to a symbol in the programming language, e.g., a variable name, keyword or number. Syntax analysis This phase takes the list of tokens produced by the lexical analysis and arranges these in a tree-structure (called the syntax tree) that reflects the structure of the program. This phase is often called parsing. Type checking This phase analyses the syntax tree to determine if the program violates certain consistency requirements, e.g., if a variable is used but not declared or if it is used in a context that doesn’t make sense given the type of the variable, such as trying to use a boolean value as a function pointer. Intermediate code generation The program is translated to a simple machineindependent intermediate language.

Register allocation The symbolic variable names used in the intermediate code are translated to numbers, each of which corresponds to a register in the target machine code. 1.3. INTERPRETERS Machine code generation The intermediate language is translated to assembly language (a textual representation of machine code) for a specific machine architecture. Assembly and linking The assembly-language code is translated into binary representation and addresses of variables, functions, etc., are determined. The first three phases are collectively called the frontend of the compiler and the last three phases are collectively called the backend. The middle part of the compiler is in this context only the intermediate code generation, but this often includes various optimisations and transformations on the intermediate code. Each phase, through checking and transformation, establishes stronger invariants on the things it passes on to the next, so that writing each subsequent phase is easier than if these have to take all the preceding into account. For example, the type checker can assume absence of syntax errors and the code generation can assume absence of type errors.

1.4 Why learn about compilers?

Few people will ever be required to write a compiler for a general-purpose language like C, Pascal or SML. So why do most computer science institutions offer compiler courses and often make these mandatory? Some typical reasons are: a) It is considered a topic that you should know in order to be “well-cultured” in computer science. b) A good craftsman should know his tools, and compilers are important tools for programmers and computer scientists. c) The techniques used for constructing a compiler are useful for other purposes as well. d) There is a good chance that a programmer or computer scientist will need to write a compiler or interpreter for a domain-specific language. The first of these reasons is somewhat dubious, though something can be said for “knowing your roots”, even in such a hastily changing field as computer science.

etc. Such words are traditionally called tokens. A lexical analyser, or lexer for short, will as its input take a string of individual letters and divide this string into tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e., lay-out characters (spaces, newlines etc.) and comments. The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis phase. In theory, the work that is done during lexical analysis can be made an integral part of syntax analysis, and in simple systems this is indeed often done. However, there are reasons for keeping the phases separate: 2 0 2 2

Efficiency: A lexer may do the simple parts of the work faster than the more

general parser can. Furthermore, the size of a system that is split in two may be smaller than a combined system. This may seem paradoxical but, as we shall see, there is a non-linear factor involved which may make a separated system smaller than a combined system. 2 0 2 2

Modularity: The syntactical description of the language need not be

cluttered with small lexical details such as white-space and comments. 2 0 2 2

Tradition: Languages are often designed with separate lexical and

syntactical phases in mind, and the standard documents of such languages typically separate lexical and syntactical elements of the languages. It is usually not terribly difficult to write a lexer by hand: You first read past initial white-space, then you, in sequence, test to see if the next token is a keyword, a 21 22 CHAPTER 2. LEXICAL ANALYSIS number, a variable or whatnot. However, this is not a very good way of handling the problem: You may read the same part of the input repeatedly while testing each possible token and in some cases it may not be clear where the next token ends. Furthermore, a handwritten lexer may be complex and difficult to maintain. Hence, lexers are normally constructed by lexer generators, which transform human-readable specifications of tokens and white-space into efficient programs. We will see the same general strategy in the chapter about syntax analysis:

Specifications in a well-defined human-readable notation are transformed into efficient programs. For lexical analysis, specifications are traditionally written using regular expressions: An algebraic notation for describing sets of strings. The generated lexers are in a class of extremely simple programs called finite automata. This chapter will describe regular expressions and finite automata, their properties and how regular expressions can be converted to finite automata. Finally, we discuss some practical aspects of lexer generators.

2.2 Regular expressions

The set of all integer constants or the set of all variable names are sets of strings, where the individual letters are taken from a particular alphabet. Such a set of strings is called a language. For integers, the alphabet consists of the digits 0- and for variable names the alphabet contains both letters and digits (and perhaps a few other characters, such as underscore). Given an alphabet, we will describe sets of strings by regular expressions, an algebraic notation that is compact and easy for humans to use and understand. The idea is that regular expressions that describe simple sets of strings can be combined to form regular expressions that describe more complex sets of strings. When talking about regular expressions, we will use the letters (r, s and t) in italics to denote unspecified regular expressions. When letters stand for themselves (i.e., in regular expressions that describe strings using these letters) we will use typewriter font, e.g., a or b. Hence, when we say, e.g., “The regular expression s” we mean the regular expression that describes a single one-letter string “s”, but when we say “The regular expression s”, we mean a regular expression of any form which we just happen to call s. We use the notation L(s) to denote the language (i.e., set of strings) described by the regular expression s. For example, L(a) is the set {“a”}. Figure 2.1 shows the constructions used to build regular expressions and the languages they describe:

L(s_ ) consists of strings that can be obtained by concatenating zero or more

(possibly different) strings from L(s). If, for example, L(s) is {“a”, “b”} then L(s_^ ) is {“”, “a”, “b”, “aa”, “ab”, “ba”, “bb”, “aaa”,... }, i.e., any

string (including the empty) that consists entirely of as and bs. 24 CHAPTER 2. LEXICAL ANALYSIS Note that while we use the same notation for concrete strings and regular expressions denoting one-string languages, the context will make it clear which is meant. We will often show strings and sets of strings without using quotation marks, e.g., write {a, bb} instead of {“a”, “bb”}. When doing so, we will use e to denote the empty string, so the example from L(s_^ ) above is written as {e, a, b, aa, ab,

ba, bb, aaa,... }. The letters u, v and w in italics will be used to denote unspecified single strings, i.e., members of some language. As an example, abw denotes any string starting with ab. Precedence rules When we combine different constructor symbols, e.g., in the regular expression a|ab _^ , it isn’t a priori clear how the different subexpressions are grouped.

We can use parentheses to make the grouping of symbols clear. Additionally, we use precedence rules, similar to the algebraic convention that 3+ 4 _ 5 means 3

added to the product of 4 and 5 and not multiplying the sum of 3 and 4 by 5. For regular expressions, we use the following conventions: _^ binds tighter than concatenation, which binds tighter than alternative (|). The example a|ab _ from above,

hence, is equivalent to a|(a(b _^ )).

The |operator is associative and commutative (as it is based on set union, which has these properties). Concatenation is associative (but obviously not commutative) and distributes over |. Figure 2.2 shows these and other algebraic properties

of regular expressions, including definitions of some shorthands introduced

below.

2.2.1 Shorthands

While the constructions in figure 2.1 suffice to describe e.g., number strings and variable names, we will often use extra shorthands for convenience. For example, if we want to describe non-negative integer constants, we can do so by saying that it is one or more digits, which is expressed by the regular expression ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) _^. The large number

of different digits makes this expression rather verbose. It gets even worse when we get to variable names, where we must enumerate all alphabetic letters (in both upper and lower case). Hence, we introduce a shorthand for sets of letters. Sequences of letters within square brackets represent the set of these letters. For example, we use [ab01] as a shorthand for a|b| 0 |1. Additionally, we can use interval notation to

abbreviate [0123456789] to [0-9]. We can combine several intervals within one bracket and for example write [a-zA-Z] to denote all alphabetic letters in both lower and upper case. 2.2. REGULAR EXPRESSIONS 25 When using intervals, we must be aware of the ordering for the symbols involved. For the digits and letters used above, there is usually no confusion. However, if we write, e.g., [0-z] it is not immediately clear what is meant. When using such notation in lexer generators, standard ASCII or ISO 8859-1 character sets are usually used, with the hereby implied ordering of symbols. To avoid confusion, we will use the interval notation only for intervals of digits or alphabetic letters. Getting back to the example of integer constants above, we can now write this much shorter as [0-9][0-9]_^.

Since s_^ denotes zero or more occurrences of s, we needed to write the set

of digits twice to describe that one or more digits are allowed. Such non-zero

(rs)t = rst = r(st)

se = s = es

r(s|t) = rs|rt

(r|s)t = rt|st

(s _) _= s _

s_ s_= s_

ss_= s + = s _s

Figure 2.2: Some algebraic properties of regular expressions Floats. A floating-point constant can have an optional sign. After this, the mantissa part is described as a sequence of digits followed by a decimal point and then another sequence of digits. Either one (but not both) of the digit sequences can be empty. Finally, there is an optional exponent part, which is the letter e (in upper or lower case) followed by an (optionally signed) integer constant. If there is an exponent part to the constant, the mantissa part can be written as an integer constant (i.e., without the decimal point). This rather involved format can be described by the following regular expression: [+−]?((([ 0 − 9 ]+^.^ [^0 −^9 ]_^ |.^ [^0 −^9 ]+^ )([eE][+−]?[^0 −^9 ]+^ )?)|[^0 −^9 ]+^ [eE][+−]?[^0 −^9 ]+^ )

This regular expression is complicated by the fact that the exponent is optional if the mantissa contains a decimal point, but not if it doesn’t (as that would make the number an integer constant). We can make the description simpler if we make the regular expression for floats include integers, and instead use other means of distinguishing these (see section 2.9 for details). If we do this, the regular expression can be simplified to [+−]?(([ 0 − 9 ]+^ (.^ [^0 −^9 ]_^ )?|.^ [^0 −^9 ]+^ )([eE][+−]?[^0 −^9 ]+^ )?)

2.3. NONDETERMINISTIC FINITE AUTOMATA 27 String constants. A string constant starts with a quotation mark followed by a sequence of symbols and finally another quotation mark. There are usually some restrictions on the symbols allowed between the quotation marks. For example, line-feed characters or quotes are typically not allowed, though these may

be represented by special sequences of other characters. As a (much simplified) example, we can by the following regular expression describe string constants where the allowed symbols are alphanumeric characters and sequences consisting of the backslash symbol followed by a letter (where each such pair is intended to represent a non-alphanumeric symbol): ”([a−zA−Z0− 9 ]|[a−zA−Z]) _^ ”

2.3 Nondeterministic finite automata

In our quest to transform regular expressions into efficient programs, we use a stepping stone: Nondeterministic finite automata. By their nondeterministic nature, these are not quite as close to “real machines” as we would like, so we will later see how these can be transformed into deterministic finite automata, which are easily and efficiently executable on normal hardware. A finite automaton is, in the abstract sense, a machine that has a finite number of states and a finite number of transitions between these. A transition between states is usually labelled by a character from the input alphabet, but we will also use transitions marked with e, the so-called epsilon transitions. A finite automaton can be used to decide if an input string is a member in some particular set of strings. To do this, we select one of the states of the automaton as the starting state. We start in this state and in each step, we can do one of the following: 2 0 2 2

Follow an epsilon transition to another state, or

2 0 2 2

Read a character from the input and follow a transition labelled by that

character. When all characters from the input are read, we see if the current state is marked as being accepting. If so, the string we have read from the input is in the language defined by the automaton. We may have a choice of several actions at each step: We can choose between

This name or number has, however, no operational significance, it is solely used for identification purposes. Accepting states are denoted by using a double circle instead of a single circle. The initial state is marked by an arrow pointing to it from outside the automaton. A transition is denoted by an arrow connecting two states. Near its midpoint, the arrow is labelled by the symbol (possibly e) that triggers the transition. Note that the arrow that marks the initial state is not a transition and is, hence, not marked by a symbol. Repeating the maze analogue, the circles (states) are rooms and the arrows (transitions) are one-way corridors. The double circles (accepting states) are exits, while the unmarked arrow to the starting state is the entrance to the maze. Figure 2.3 shows an example of a nondeterministic finite automaton having three states. State 1 is the starting state and state 3 is accepting. There is an epsilon-transition from state 1 to state 2, transitions on the symbol a from state 2 to states 1 and 3 and a transition on the symbol b from state 1 to state 3. This NFA recognises the language described by the regular expression a _^ (a|b). As

an example, the string aab is recognised by the following sequence of transitions: 2.3. NONDETERMINISTIC FINITE AUTOMATA 29 from to by 1 2 e 2 1 a 1 2 e 2 1 a 1 3 b At the end of the input we are in state 3, which is accepting. Hence, the string is accepted by the NFA. You can check this by placing a coin at the starting state and follow the transitions by moving the coin. Note that we sometimes have a choice of several transitions. If we are in state 2 and the next symbol is an a, we can, when reading this, either go to state 1 or to state 3. Likewise, if we are in state 1 and the next symbol is a b, we can either

read this and go to state 3 or we can use the epsilon transition to go directly to state 2 without reading anything. If we in the example above had chosen to follow the a-transition to state 3 instead of state 1, we would have been stuck: We would have no legal transition and yet we would not be at the end of the input. But, as previously stated, it is enough that there exists a path leading to acceptance, so the string aab is still accepted. A program that decides if a string is accepted by a given NFA will have to check all possible paths to see if any of these accepts the string. This requires either backtracking until a successful path found or simultanious following all possible paths, both of which are too time-consuming to make NFAs suitable for efficient recognisers. We will, hence, use NFAs only as a stepping stone between regular expressions and the more efficient DFAs. We use this stepping stone because it makes the construction simpler than direct construction of a DFA from a regular expression.

__ __ 1 b - _ e __ __ __ __ 3 __ __ 2

a ? a Figure 2.3: Example of an NFA 30 CHAPTER 2. LEXICAL ANALYSIS

2.4 Converting a regular expression to an NFA

We will construct an NFA compositionally from a regular expression, i.e., we will construct the NFA for a composite regular expression from the NFAs constructed

out all shorthand, e.g. converting s + to ss _ , [ 0 − 9 ] to 0| 1 | 2 | ∙∙ ∙|9 and s?

to s|e,

etc. However, this will result in very large NFAs for some expressions, so we use a few optimised constructions for the shorthands. Additionally, we show an alternative construction for the regular expression e. This construction doesn’t quite follow the formula used in figure 2.4, as it doesn’t have two half- transitions. Rather, the line-segment notation is intended to indicate that the NFA fragment for e just connects the half-transitions of the NFA fragments that it is combined with. In the construction for [ 0 − 9 ], the vertical ellipsis is meant to indicate

that 2.4. CONVERTING A REGULAR EXPRESSION TO AN NFA 31 Regular expression NFA fragment a - __ __ a e - __ __ e s t -^ s^ -^ t s|t

__ __

  • e e - s t __ ^ _ _ e s^ -

__ __ e

e

s Figure 2.4: Constructing NFA fragments from regular expressions 32 CHAPTER 2. LEXICAL ANALYSIS

__ __ 1 e -

e __ __ 2 a - __ __ 3 c - __ __ __ __ 4 __ __ 5 e - e - __ __ 6 R a __ __ 7 _ b __ __ 8

e Figure 2.5: NFA for the regular expression (a|b) _^ ac

there is a transition for each of the digits in [ 0 − 9 ]. This construction

generalises in the obvious way to other sets of characters, e.g., [a−zA−Z0− 9 ]. We have

not shown a special construction for s? as s|e will do fine if we use the

optimised construction for e.

__ __ _ e e _ s Figure 2.6: Optimised NFA construction for regular expression shorthands

__ __ e __ __ _ e e - __ __ __ __

__ __R 0 _ 9

__ __ e Figure 2.7: Optimised NFA for [ 0 − 9 ]+

34 CHAPTER 2. LEXICAL ANALYSIS transition), essentially implementing the move function by table lookup. Another (one-dimensional) table can indicate which states are accepting. DFAs have the same expressive power as NFAs: A DFA is a special case of NFA and any NFA can (as we shall shortly see) be converted to an equivalent DFA. However, this comes at a cost: The resulting DFA can be exponentially larger than the NFA (see section 2.10). In practice (i.e., when describing tokens for a programming language) the increase in size is usually modest, which is why most lexical analysers are based on DFAs.

2.6 Converting an NFA to a DFA

As promised, we will show how NFAs can be converted to DFAs such that we, by combining this with the conversion of regular expressions to NFAs shown in section 2.4, can convert any regular expression to a DFA. The conversion is done by simulating all possible paths in an NFA at once.

This means that we operate with sets of NFA states: When we have several choices of a next state, we take all of the choices simultaneously and form a set of the possible next-states. The idea is that such a set of NFA states will become a single DFA state. For any given symbol we form the set of all possible next-states in the NFA, so we get a single transition (labelled by that symbol) going from one set of NFA states to another set. Hence, the transition becomes deterministic in the DFA that is formed from the sets of NFA states. Epsilon-transitions complicate the construction a bit: Whenever we are in an NFA state we can always choose to follow an epsilon-transition without reading any symbol. Hence, given a symbol, a next-state can be found by either following a transition with that symbol or by first doing any number of epsilon- transitions and then a transition with the symbol. We handle this in the construction by first closing the set of NFA states under epsilon-transitions and then following transitions with input symbols. We define the epsilon-closure of a set of states as the set extended with all states that can be reached from these using any number of epsilon-transitions. More formally: Definition 2.2 Given a set M of NFA states, we define e-closure(M) to be the least (in terms of the subset relation) solution to the set equation e-closure(M) = M[{t |s 2 e-closure(M) and s et^2 T}

Where T is the set of transitions in the NFA. We will later on see several examples of set equations like the one above, so we use some time to discuss how such equations can be solved. 2.6. CONVERTING AN NFA TO A DFA 35

2.6.1 Solving set equations

In general, a set equation over a single set-valued variable X has the form X = F(X) where F is a function from sets to sets. Not all such equations are solvable, so we will restrict ourselves to special cases, which we will describe below. We will use calculation of epsilon-closure as the driving example. In definition 2.2, e-closure(M) is the value we have to find, so we replace this