Parts of this assignment are recycled from last semester's CSC 271 course, but I don't think any of you actually did that particular assignment, so let's give it another try.
Download, compile, and run the Lex examples from last semester. (Nothing to turn in, but make sure you understand how these programs work and how to compile them before going on.)
Modify the Roman numerals example as follows:
In addition to IV, IX, XL, XC, CD, and CM, it should also be possible to use IL, IC, ID, IM, XD, and XM with the obvious meanings. (This should be easy.)
Modify the Roman numerals example as follows (this doesn't depend on the previous problem, but it's more difficult so you should probably do the previous problem first):
Make the scanner reject (by returning a special value like -1) anything with parts out of order, or too many of the same thing in a row. For example, in standard (modern) Roman numerals, IIII is illegal (it should be written IV instead). IIC is illegal (the hundreds should come before the ones). IVIV is illegal (you can't use IV twice; this should instead be written VIII). IXI is illegal (if you're using IX, you shouldn't also be using I later on; this should be written X instead). And so on.
Write, in flex
, a filter that eliminates vowels (a,
e, i, o, u) and passes everything else through unchanged.
Write, in flex
, a filter that counts how many
occurrences there are of each vowel and prints out the results. It
shouldn't actually print anything from the input itself.
Write, in flex
, a simple scanner for our AE language
(a whole lot easier than a scanner for a Java-like language!): it
should distinguishes among integer literals, identifiers, left curly-braces, and right
curly-braces. An integer literal comprises a sequence of one or more digits,
optionally preceded by a - or + sign; its numeric value should be stored with the token.
An identifier comprises a sequence of one or more letters and other punctuation marks
(but not curly-braces); its name should be stored with the token. One or more spaces
indicate that you've gotten to the end of an integer literal or an identifier, but
should otherwise be ignored. A semicolon and anything after it to the next newline
character should be treated as white space.
The yylex
function should return a 1, 2, 3, or 4 to indicate “integer
literal”, “identifier”, “left curly-brace”, or “right curly-brace”; for integer
literals and identifiers, the extra information should be stored in a global variable
yylval
, declared in a shared header file like the following:
#ifndef __YYS_HEADER__ #define __YYS_HEADER__ typedef union YYSTYPE { int number; char *str; }; extern YYSTYPE yylval; #endifYour main program which tests the scanner should include this header file, declare
yylval
(without an extern
,
so memory is actually allocated for it) and examine the value of
yylval
after each call to yylex
and make
sure it's what it's supposed to be.
Note that following the above instructions should produce a scanner which
can be used easily by a parser written in bison
. And writing such
a parser wouldn't be particularly difficult, except... what should it return?
A parse tree, naturally, and we don't have such a type already defined in C++,
so we would have to define one, which I think would take longer than the time
remaining in this course. Maybe we'll do that in Data Structures next fall....
Write, in bison
, a filter that reads an HTML or XML file
and tells whether the tags are legally nested. For example,
<em>This is emphasized <font size="big">big</font> text</em>is correctly nested, but
<em>This is emphasized <font size="big">big</em> text</font>is not, because the outer tag ends before the inner one, and
<em>This is emphasized <font size="big">big</em> textis not, because one of the tags isn't closed at all, and
<em>This is emphasized big</em> text</font>is not, because one of the tags is closed without ever being opened.
It would be neat if you could handle any HTML or XML tag, but if you can handle two or three kinds (say, <font size="...">, <em>, and <em>), that's enough. Don't worry about what the tags or their parameters mean; your only concern is whether they're nested legally.
Write, in bison
, an interpreter for our AE language.
It should use the scanner from problem 6 above, and include the same header file as
above (which I've written to be compatible with bison
).
It should return the numeric value of an expression, or an error message
if the expression isn't syntactically legal. Since we don't have “with”
in the language, there will be no variables with values, so the only identifiers
appearing anywhere should be +, -, *, and /, each appearing only just after a
left-curly-brace; your bison
program should decide which of these
four identifiers it saw and compute the answer appropriately from the values
of the operands.
(You may get some useful ideas from ch301.y, one of the examples from the lex/yacc book.)