









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Notes; Professor: McCracken; Class: Compiler Design; Subject: Computer & Info Science; University: Syracuse University; Term: Fall 2007;
Typology: Study notes
1 / 16
This page cannot be seen from the preview
Don't miss anything!
CS143 Handout 07 Autumn 2007 October 3, 2007
Handout written by Maggie Johnson and revised by Julie Zelenski. Possible Approaches The syntax analysis phase of a compiler verifies that the sequence of tokens extracted by the scanner represents a valid sentence in the grammar of the programming language. There are two major parsing approaches: top-down and bottom-up. In top-down parsing, you start with the start symbol and apply the productions until you arrive at the desired string. In bottom-up parsing, you start with the string and reduce it to the start symbol by applying the productions backwards. As an example, let’s trace through the two approaches on this simple grammar that recognizes strings consisting of any number of a’s followed by at least one (and possibly more) b’s: S –> AB A –> aA | ε B –> b | bB Here is a top-down parse of aaab. We begin with the start symbol and at each step, expand one of the remaining nonterminals by replacing it with the right side of one of its productions. We repeat until only terminals remain. The top-down parse produces a leftmost derivation of the sentence. S AB S –> AB aAB A –> aA aaAB A –> aA aaaAB A –> aA aaaεB A –> ε aaab B –> b A bottom-up parse works in reverse. We begin with the sentence of terminals and each step applies a production in reverse, replacing a substring that matches the right side with the nonterminal on the left. We continue until we have substituted our way back to the start symbol. If you read from the bottom to top, the bottom-up parse prints out a rightmost derivation of the sentence. aaab aaaεb (insert ε) aaaAb A –> ε aaAb A –> aA aAb A –> aA Ab A –> aA AB B –> b S S –> AB
In creating a parser for a compiler, we normally have to place some restrictions on how we process the input. In the above example, it was easy for us to see which productions were appropriate because we could see the entire string aaab. In a compiler’s parser, however, we don’t have long-distance vision. We are usually limited to just one-symbol of lookahead. The lookahead symbol is the next symbol coming up in the input. This restriction certainly makes the parsing more challenging. Using the same grammar from above, if the parser sees only a single b in the input and it cannot lookahead any further than the symbol we are on, it can’t know whether to use the production B –> b or B –> bB. Backtracking One solution to parsing would be to implement backtracking. Based on the information the parser currently has about the input, a decision is made to go with one particular production. If this choice leads to a dead end, the parser would have to backtrack to that decision point, moving backwards through the input, and start again making a different choice and so on until it either found the production that was the appropriate one or ran out of choices. For example, consider this simple grammar: S –> bab | bA A –> d | cA Let’s follow parsing the input bcd. In the trace below, the column on the left will be the expansion thus far, the middle is the remaining input, and the right is the action attempted at each step: S bcd Try S –> bab bab bcd match b ab cd dead-end, backtrack S bcd Try S –> bA bA bcd match b A cd Try A –> d d cd dead-end, backtrack A cd Try A –> cA cA cd match c A d Try A –> d d d match d Success! As you can see, each time we hit a dead-end, we backup to the last decision point, unmake that decision and try another alternative. If all alternatives have been exhausted, we back up to the preceding decision point and so on. This continues until we either find a working parse or have exhaustively tried all combinations without success. A number of authors have described backtracking parsers; the appeal is that they can be used for a variety of grammars without requiring them to fit any specific form. For a
void ParseFunction() { if (lookahead != T_FUNC) { // anything not FUNC here is wrong printf("syntax error \n"); exit(0); } else lookahead = yylex(); // global 'lookahead' holds next token ParseIdentifier(); if (lookahead != T_LPAREN) { printf("syntax error \n"); exit(0); } else lookahead = yylex(); ParseParameterList(); if (lookahead!= T_RPAREN) { printf("syntax error \n"); exit(0); } else lookahead = yylex(); ParseStatements(); } To make things a little cleaner, let’s introduce a utility function that can be used to verify that the next token is what is expected and will error and exit otherwise. We will need this again and again in writing the parsing routines. void MatchToken(int expected) { if (lookahead != expected) { printf("syntax error, expected %d, got %d\n", expected,lookahead); exit(0); } else // if match, consume token and move on lookahead = yylex(); } Now we can tidy up the ParseFunction routine and make it clearer what it does: void ParseFunction() { MatchToken(T_FUNC); ParseIdentifier(); MatchToken(T_LPAREN); ParseParameterList(); MatchToken(T_RPAREN); ParseStatements(); }
The following diagram illustrates how the parse tree is built: Here is the production for an if-statement in this language: if_statement –> IF expression THEN statement ENDIF | IF expression THEN statement ELSE statement ENDIF To prepare this grammar for recursive-descent, we must left-factor to share the common parts: if_statement –> IF expression THEN statement close_if close_if –> ENDIF | ELSE statement ENDIF Now, let’s look at the recursive-descent functions to parse an if statement: void ParseIfStatement() { MatchToken(T_IF); ParseExpression(); MatchToken(T_THEN); ParseStatement(); ParseCloseIf(); } void ParseCloseIf() { if (lookahead == T_ENDIF) // if we immediately find ENDIF lookahead = yylex(); // predict close_if -> ENDIF else { MatchToken(T_ELSE); // otherwise we look for ELSE ParseStatement(); // predict close_if -> ELSE stmt ENDIF MatchToken(T_ENDIF); } } When parsing the closing portion of the if, we have to decide which of the two right- hand side options to expand. In this case, it isn’t too difficult. We try to match the first token again ENDIF and on non-match, we try to match the ELSE clause and if that doesn’t match, it will report an error. function_list function FUNC parameter_list this part of the tree is parsed by the call to ParseIdentifier this part is parsed by the call to ParseParameterList this part is parsed by the call to ParseStatements identifier^ statements program
languages, it is usually possible to re-structure the productions or embed certain rules into the parser to resolve conflicts, but this constraint is one of the weaknesses of the top-down non-backtracking approach. It is a bit trickier if the nonterminal we are trying to recognize is nullable. A nonterminal A is nullable if there is a derivation of A that results in ε (i.e., that nonterminal would completely disappear in the parse string) i.e., ε ∈ First(A). In this case A could be replaced by nothing and the next token would be the first token of the symbol following A in the sentence being parsed. Thus if A is nullable, our predictive parser also needs to consider the possibility that the path to choose is the one corresponding to A =>* ε. To deal with this we define the following: The follow set of a nonterminal A is the set of terminal symbols that can appear immediately to the right of A in a valid sentence. A bit more formally, for every valid sentence S =>* uAv, where v begins with some terminal, and that terminal is in Follow(A). Informally, you can think about the follow set like this: A can appear in various places within a valid sentence. The follow set describes what terminals could have followed the sentential form that was expanded from A. We will detail how to calculate the follow set a bit later. For now, realize follow sets are useful because they define the right context consistent with a given nonterminal and provide the lookahead that might signal a nullable nonterminal should be expanded to ε. With these two definitions, we can now generalize how to handle A –> u 1 | u 2 | ..., in a recursive-descent parser. In all situations, we need a case to handle each member in First(ui). In addition if there is a derivation from any ui that could yield ε (i.e. if it is nullable) then we also need to handle the members in Follow(A). void ParseA() { switch (lookahead) { case First(u 1 ): /* code to recognize u 1 / return; case First(u 2 ): / code to recognize u 2 / return; ... case Follow(A): // predict production A->epsilon if A is nullable / usually do nothing here */ default: printf("syntax error \n"); exit(0); } }
What about left-recursive productions? Now we see why these are such a problem in a predictive parser. Consider this left-recursive production that matches a list of one or more functions. function_list –> function_list function | function function –> FUNC identifier ( parameter_list ) statement void ParseFunctionList() { ParseFunctionList(); ParseFunction(); } Such a production will send a recursive-descent parser into an infinite loop! We need to remove the left-recursion in order to be able to write the parsing function for a function_list. function_list –> function_list function | function becomes function_list –> function function_list | function then we must left-factor the common parts function_list –> function more_functions more_functions –> function more_functions | ε And now the parsing function looks like this: void ParseFunctionList() { ParseFunction(); ParseMoreFunctions(); // may be empty (i.e. expand to epsilon) } Computing first and follow These are the algorithms used to compute the first and follow sets: Calculating first sets. To calculate First(u) where u has the form X 1 X 2 ...Xn, do the following: a) If X 1 is a terminal, add X 1 to First(u) and you're finished. b) Else X 1 is a nonterminal, so add First(X 1 ) - ε to First(u). a. If X 1 is a nullable nonterminal, i.e., X 1 =>* ε, add First(X 2 ) - ε to First(u). Furthermore, if X 2 can also go to ε, then add First(X 3 ) - ε and so on, through all Xn until the first non-nullable symbol is encountered. b. If X 1 X 2 ...Xn =>* ε, add ε to the first set.
Start with First(A) - ε, add First(B) (since A is nullable). We don’t add ε (since S itself is not-nullable— A can go away, but B cannot) It is usually convenient to compute the first sets for the nonterminals that appear toward the bottom of the parse tree and work your way upward since the nonterminals toward the top may need to incorporate the first sets of the terminals that appear beneath them in the tree. To compute the follow sets, take each nonterminal and go through all the right-side productions that the nonterminal is in, matching to the steps given earlier: Follow(S) = {$} S doesn’t appear in the right hand side of any productions. We put $ in the follow set because S is the start symbol. Follow(B) = {$} B appears on the right hand side of the S –> AB production. Its follow set is the same as S. Follow(B') = {$} B' appears on the right hand side of two productions. The B' –> aACB' production tells us its follow set includes the follow set of B', which is tautological. From B –> cB', we learn its follow set is the same as B. Follow(C) = {a $} C appears in the right hand side of two productions. The production A –> Ca tells us a is in the follow set. From B' –> aACB' , we add the First(B') which is just a again. Because B' is nullable, we must also add Follow(B') which is $. Follow(A) = {c b a $} A appears in the right hand side of two productions. From S –> AB we add First(B) which is just c. B is not nullable. From B' –> aACB' , we add First(C) which is b. Since C is nullable, so we also include First(B') which is a. B' is also nullable, so we include Follow(B') which adds $. It can be convenient to compute the follows sets for the nonterminals that appear toward the top of the parse tree and work your way down, but sometimes you have to circle around computing the follow sets of other nonterminals in order to complete the one you’re on. The calculation of the first and follow sets follow mechanical algorithms, but it is very easy to get tripped up in the details and make mistakes even when you know the rules. Be careful!
Table-driven LL(1) Parsing In a recursive-descent parser, the production information is embedded in the individual parse functions for each nonterminal and the run-time execution stack is keeping track of our progress through the parse. There is another method for implementing a predictive parser that uses a table to store that production along with an explicit stack to keep track of where we are in the parse. This grammar for add/multiply expressions is already set up to handle precedence and associativity: E –> E + T | T T –> T * F | F F –> (E) | int After removal of left recursion, we get: E –> TE' E' –> + TE' | ε T –> FT' T' –> * FT' | ε F –> (E) | int One way to illustrate the process is to study some transition graphs that represent the grammar: E: E’: T: T’: F:
ε F T’
A predictive parser behaves as follows. Let’s assume the input string is 3 + 4 * 5. Parsing begins in the start state of the symbol E and moves to the next state. This transition is marked with a T, which sends us to the start state for T. This in turn, sends us to the start state for F. F has only terminals, so we read a token from the input string. It must either be an open parenthesis or an integer in order for this parse to be valid. We consume the integer token, and thus we have hit a final state in the F transition
TE'$ int + int * int$ Predict T–> FT' FT'E’$ int + int * int$ Predict F–> int intT'E'$ int + int * int$ Match int, pop from stack, move ahead in input T'E'$ + int * int$ Predict T'–>ε E'$ + int * int$ Predict E'–> +TE' +TE'$ + int * int$ Match +, pop TE'$ int * int$ Predict T–>FT' FT'E'$ int * int$ Predict F–> int intT'E'$ int * int$ Match int, pop T'E'$ * int$ Predict T'–> *FT' *FT'E'$ * int$ Match *, pop FT'E'$ int$ Predict F–> int intT'E'$ int$ Match int, pop T'E'$ $ Predict T'–> ε E'$ $ Predict E'–> ε $ $ Match $, pop, success! Suppose, instead, that we were trying to parse the input +$. The first step of the parse would give an error because there is no entry at M[E, +]. Constructing The Parse Table The next task is to figure out how we built the table. The construction of the table is somewhat involved and tedious (the perfect task for a computer, but error-prone for humans). The first thing we need to do is compute the first and follow sets for the grammar: E –> TE' E' –> + TE' | ε T –> FT' T' –> * FT' | ε F –> (E) | int First(E) = First(T) = First(F) = { ( int } First(T') = { * ε } First(E') = { + ε } Follow(E) = Follow(E') { $ ) } Follow(T) = Follow(T') = { + $ ) } Follow(F) = { * + $ ) } Once we have the first and follow sets, we build a table M with the leftmost column labeled with all the nonterminals in the grammar, and the top row labeled with all the terminals in the grammar, along with $. The following algorithm fills in the table cells:
cascading errors when the variable is later used, but it might get through the trouble spot. We could also use the symbols in First(A) as a synchronizing set for re-starting the parse of A. This would allow input junk double d; to parse as a valid variable declaration. Bibliography A. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. J.P. Bennett, Introduction to Compiling Techniques. Berkshire, England: McGraw-Hill, 1990. D. Cohen, Introduction to Computer Theory. New York: Wiley, 1986. C. Fischer, R. LeBlanc, Crafting a Compiler. Menlo Park, CA: Benjamin/Cummings, 1988. K. Loudon, Compiler Construction. Boston, MA: PWS, 1997.