Agent Work: Parsing and Abstract Syntax
Claude Opus 4.6 · COMP 411: Principles of Programming Languages
Assignment 1: Parsing and Abstract Syntax
Overview
Testing: Use bin/grade or the grade tool to run the test suite.
Preparation
This assignment requires Java 8 (Amazon Corretto 8 or Oracle JDK8). The solution must support the same API as the starter files. Fill in the elided code marked by ... in the stub files. You are free to add any additional classes. If you add JUnit test classes, ensure the class name ends in "Test" and that test methods begin with the prefix "test" (JUnit 3.8 conventions).
Testing You are expected to write unit tests for all of your non-trivial methods or functions. You can run the sample test suite in Assign1Test.java to confirm your program supports the required API.
Task Your task is to write a parser in Java 8+ that recognizes expressions in the pedagogic functional language Jam, which will be defined subsequently. A lexer is provided. This lexer returns a larger set of symbols (tokens) than what appears in the Jam Syntax Definition below. These extra symbols anticipate future extensions to Jam in later projects. At this point, your parser should simply reject any program containing one of these extra symbols as ill-formed, just like it would for any other syntax error.
Java Support Library In the provided support files, the parser's input token stream is generated by a Lexer object supporting the 0-ary method readToken(). The readToken() method produces a stream of Token objects belonging to various subclasses of the Token interface. The Lexer class supports two constructors: (i) a zero-ary constructor that converts standard input to a Lexer and (ii) a unary constructor Lexer(String fileName) that converts the specified file to a Lexer. The file Lexer.java contains the definition of the Lexer class and the collection of classes representing tokens (the Java interface type Token) and abstract syntax trees (the Java interface type AST).
The API That Your Program Must Support
The API provided in the starter repository exclusively uses must be in the default (top-level) package. All of the classes that you create to augment those in the starter repository should be in the empty package.
The starter repository contains a folder (directory) called src containing of the provided source files. We recommend that you create a parallel directory named classes to hold all of the class files generated by the javac compiler. You can easily do this in DrJava by creating a project with src as the root of the source folder and classes as the build folder.
In the src folder, the stub file Parser.java includes constructors Parser(Reader stream) and Parser(String filename) that a create parser to read the specified stream or file. The Reader variant of these constructors is very convenient when you are testing your parser from the DrJava read-eval-print loop or from unit test methods. The String variant supports parsing Jam programs that reside in files.
The stub Parser class contains a public method AST parse() with a stubbed out body that returns the abstract syntax tree for the Jam expression in the input stream. It also contains one parse method private AST parseExp() with a stubbed out body that is the primary helper method for parse() along with a contract specifying what parseExp does. The grading program will print the output using the method System.out.println, which invokes the toString method on its argument. All of the requisite abstract syntax classes including toString methods are defined in the file AST.java in the starter repository. All of these support files are located in the folder src in the starter repository for Assignment 1.
If your parser encounters an erroneous input (including illegal tokens), simply throw a ParseException containing a message explanation explaining the error. This operation will abort parsing the program. The parser in this assignment does not attempt to recover from syntax errors to continue parsing. This behavior requires a complex specification including many special cases, which is beyond the scope of this assignment.
The starter repository should compile without any errors but the unit test suite in Assign1Test.java will fail since it tries to test unwritten code. This test suite is grossly incomplete; it includes a few sample tests as examples to show you how should write the remaining tests. One measure of the thoroughness of your test suite is how well it covers the statements and branches in your program. Such coverage will almost certainly be incomplete (particularly the branches) since some points in the code may be unreachable and some branches may be impossible to traverse.
Jam Syntax Definition
The following paragraphs define the parser's input language and the subset that your parser should recognize as legal Jam programs.
The Meta-Language
The parser's input language is described using Extended Backus-Naur Form (EBNF). When reading the notation, keep the following in mind:
- A symbol in the meta-language always starts with a capital letter. Therefore,
Symbolis a symbol in the-meta language, whereasliteralis the actual text "literal". Note that this convention depends on the fact that no literals in Jam begin with a capital letter. - A phrase *P* enclosed in braces **{ *P* }** is optional, i.e. both *P* and the empty string match **{ *P* }**.
- A phrase *P* followed by \* is iterated zero or more times, e.g. Def\* means the concatenation of zero or more Def phrases:
* Def * Def Def * ... * and so on, including the empty string.
- A phrase *P* followed by + is iterated one or more times, e.g. Def+ means the concatenation of one or more Def phrases:
* Def * Def Def * ... * and so on, but not the empty string.
- If a phrase *P* enclosed in brackets **{ *P* } is followed by a \* or +, the braces simply denote grouping: the enclosed phrase is iterated as specified by the \* or + symbol. For example, { A B }\* is matched by zero or more concatenations of A B**:
* A B * A B A B * and so on, as well as the empty string, but not A or A B A.
- If something that is interpreted as part of the meta language should be interpreted literally as part of the language we are describing, it will be enclosed in *single* quotes. Therefore, 'A' is the actual letter "A", not the symbol A in the meta language, and **P '\*' is the phrase P followed by an actual asterisk, and does not mean that the phase P** should be repeated zero or more times, as described above.
The Input Language
The parser's input language (modulo a few changes described in the section Left-associative Binary Operators) is Input where:
Input ::= Token*
Token ::= AlphaOther AlphaOtherNumeric* | Delimiter | Operator | IntThe lexer recognizes keywords, delimiters (parentheses, commas, semicolons), primitive operations, identifiers, and numbers.
Adjacent tokens must be separated by whitespace (a nonempty sequence of spaces, tabs, and newlines) unless one of the tokens is a delimiter or operator. In identifying operators, the lexer chooses the longest possible match. Hence, <= is interpreted as a single token rather than < followed by =.
Lower ::= a | b | c | d | ... | z
Upper ::= 'A' | 'B' | 'C' | 'D' | ... | 'Z'
Other ::= ? | _
Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Alpha ::= Upper | Lower
AlphaOther ::= Alpha | Other
AlphaOtherNumeric ::= AlphaOther | Digit
Delimiter ::= ( | ) | [ | ] | , | ;
Operator ::= '+' | - | ~ | '*' | / | = | != | < | > | <= | >= | & | '|' | :=
Int ::= Digit+Note: The terminal symbols 'A', 'B', ... are enclosed in single quotation marks because A, B, ... would automatically be the names of non-terminal symbols in our notation system. For essentially the same reason, we enclose the symbols '+', '\*', and '|' in single quotation marks; these symbols without quotation marks are metasymbols in the notation scheme described above.
Jam The set of legitimate Jam phrases is the subset of the potential inputs consisting the expressions *Exp* defined by the following grammar:
Expressions
Exp ::= Term { Binop Exp }
| if Exp then Exp else Exp
| let Def+ in Exp
| map IdList to Exp
Term ::= Unop Term
| Factor { ( ExpList ) }
| Empty
| Int
| Bool
Factor ::= ( Exp ) | Prim | Id
ExpList ::= { PropExpList }
PropExpList ::= Exp { , Exp }*
IdList ::= { PropIdList }
PropIdList ::= Id { , Id }*Definitions
Def ::= Id := Exp ;Primitive Constants, Operators, and Operations
Empty ::= empty
Bool ::= true | false
Unop ::= Sign | ~
Sign ::= '+' | -
Binop ::= Sign | '*' | / | = | != | < | > | <= | >= | & | '|'
Prim ::= number? | function? | list? | empty? | cons? | cons | first | rest | arityIdentifiers
Id ::= AlphaOther {AlphaOther | Digit}*Please note that *Id* does not contain anything that matches symbols in *Prim* or the keywords if, then, else, map, to, let, in, empty, true, and false.
Numbers
Int ::= Digit+The preceding grammar requires one symbol look-ahead in a few situations. The Java lexer in our support code includes a method Token peek() that behaves exactly like the method Token readToken() except for the fact that it leaves the scanned token at the front of the input stream (in contrast readToken() removes the scanned token from the input stream).
The Java file lexer.java defines all of the abstract syntax classes required to represent Jam programs using the composite design pattern. There is one constructor for each fundamentally different form of expression in the definition of Jam syntax given above. Some primitive (non-recursive) abstract syntax classes are the same the corresponding token classes.
For example, suppose the Jam program under consideration is
f(x) + (x * 12)The abstract syntax representation for this program phrase is:
new BinOpApp(
new Op("+"),
new App(new Variable("f"), new AST[] { new Variable("x") }),
new BinOpApp(new Op("*"), new Variable("x"), new IntConstant(12))
)This construction does not tell the whole story. The lexer is guaranteed to always return the *same* object for a given operator. Similarly, there is only copy of each variable encountered by the lexer. Hence, there is only one copy of Op("+"), one copy of the **Op("\*"), one copy of the Variable("x")**, *etc*. As a result, the == operator in Java can be used to compare operators and variables. Note that the operators Op("+") and Op("-") can be used either as unary or binary operators.
Left-Associative Binary Operators
The preceding grammar for Jam has an annoying flaw. To make the language easy to parse using recursive descent (top-down) parsing, we used right recursion in the definition of arithmetic expressions (Exp). (Left recursion forces infinite looping in top-down parsing.) But right recursion yields right-associative ASTs (where computation must be performed right to left), while the usual mathematical convention in arithmetic expressions is that binary operators are left-associative (where computation must be performed left to right). There is a widely-used trick in writing grammars for top-down parsing using syntax diagrams that avoids this problem. The trick is replace right recursion in a simple production (only one alternative on the right hand side) by iteration: an edge looping back to the input edge of the diagram. We used this trick in generating the syntax diagrams for ExpList and IdList for our original grammar. Given our original Jam grammar, we can convert the production
Exp ::= Term { Binop Exp } | ...to
Exp ::= BinaryExp | ...
BinaryExp ::= Term { Binop BinaryExp }at the cost of excluding chains of binary operations terminating in an expression other than a term. Then this revised grammar can be represented using syntax diagrams that define BinaryExp iteratively (just like PropExpList in the original grammar).
The preceding definition of BinaryExp generates the same set of strings as the extended rule:
BinaryExp ::= Term { BinOp Term }*which is ambiguous about left or right associativity. When iteration is used to recognize a BinaryExp, the parsing process can construct a left-associative tree since the tree is constructed left-to-right (each new Binop adds a new root to the AST). This convention make the syntax diagram formulation of our Jam grammar deterministic and unambiguous.
The file JamSyntaxDiagrams file in Assignment 1 folder contains iterative syntax diagrams corresponding to the original left associative version of the Jam language. These diagrams precisely describe the syntax of Jam and how we want the parser to behave. Of course, this language eliminates a few quirky inputs from the language based on our original grammar. Your parser should treat these quirky inputs as erroneous.
It is possible to write syntax diagrams that treat binary expressions iteratively and still allow arbitrary expressions at the end of the binary expression. But the corresponding grammar and syntax diagrams are much more complex. So we have not followed this design choice in creating the revised syntax for Jam. In fact, we view our Jam grammar as a case study showing that pure context-free grammars are not quite the right formalism for expressing easily parsed program syntax. Syntax diagrams where iteration is always left-associative (equivalent to extended context-free-grammars where iteration always expands into left-associative parse trees) are much better. I am not aware of a rigorous exposition of such extended BNF grammars in the literature; I suspect that most theoretical computer scientists would not find such an exposition very interesting even though the standard notion of parsing language specified by a context-free grammar is an imperfect formalization of practical parsing.
Write your parser to produce left-associative ASTs for arithmetic expressions. If your parser implements either of the "revised" syntax diagram definitions given above (which are equivalent) using while loops for iteration and constructs the corresponding abstract syntax trees in a functional style (no mutation of trees in the construction process), your parser will naturally build syntax trees that are left-associative.
Testing Your Program
The provided code includes some JUnit4 test classes which are not comprehensive. You need to add more test cases; you need to check every production in the grammar. If a production involves iteration or recursion, you need to test the base cases as well as simple inductive constructions.
Each method in your program should include a short comment stating precisely what it does. For routines that parse a particular form of program phrase, the comment should provide the grammar rules describing that form.
Implementation Hints
The toString() methods for each AST class effectively provide an unparser for ASTs. You can directly compare test input strings with output ASTs by unparsing the ASTs (using toString()). The two strings must match exactly; differences in whitespace, new lines, and parentheses used for grouping may force you to slightly reformat your input strings to match the conventions embedded in the implementation of toString().
--- *COMP 411: Principles of Programming Languages, Rice University*
BSCS Bench