CSC 272
Homework 7
Assigned Apr 18, due Apr 30

Parts of this assignment are recycled from last semester's CSC 271 course, but I don't think any of you actually did that particular assignment, so let's give it another try.

    Exercises in flex

  1. Download, compile, and run the Lex examples from last semester. (Nothing to turn in, but make sure you understand how these programs work and how to compile them before going on.)

  2. Modify the Roman numerals example as follows:

    In addition to IV, IX, XL, XC, CD, and CM, it should also be possible to use IL, IC, ID, IM, XD, and XM with the obvious meanings. (This should be easy.)

  3. Modify the Roman numerals example as follows (this doesn't depend on the previous problem, but it's more difficult so you should probably do the previous problem first):

    Make the scanner reject (by returning a special value like -1) anything with parts out of order, or too many of the same thing in a row. For example, in standard (modern) Roman numerals, IIII is illegal (it should be written IV instead). IIC is illegal (the hundreds should come before the ones). IVIV is illegal (you can't use IV twice; this should instead be written VIII). IXI is illegal (if you're using IX, you shouldn't also be using I later on; this should be written X instead). And so on.

  4. Write, in flex, a filter that eliminates vowels (a, e, i, o, u) and passes everything else through unchanged.

  5. Write, in flex, a filter that counts how many occurrences there are of each vowel and prints out the results. It shouldn't actually print anything from the input itself.

  6. Write, in flex, a simple scanner for our AE language (a whole lot easier than a scanner for a Java-like language!): it should distinguishes among integer literals, identifiers, left curly-braces, and right curly-braces. An integer literal comprises a sequence of one or more digits, optionally preceded by a - or + sign; its numeric value should be stored with the token. An identifier comprises a sequence of one or more letters and other punctuation marks (but not curly-braces); its name should be stored with the token. One or more spaces indicate that you've gotten to the end of an integer literal or an identifier, but should otherwise be ignored. A semicolon and anything after it to the next newline character should be treated as white space.

    The yylex function should return a 1, 2, 3, or 4 to indicate “integer literal”, “identifier”, “left curly-brace”, or “right curly-brace”; for integer literals and identifiers, the extra information should be stored in a global variable yylval, declared in a shared header file like the following:

    #ifndef __YYS_HEADER__
    #define __YYS_HEADER__
    typedef union YYSTYPE
    {	int number;
    	char *str;
    };
    extern YYSTYPE yylval;
    #endif
    
    Your main program which tests the scanner should include this header file, declare yylval (without an extern, so memory is actually allocated for it) and examine the value of yylval after each call to yylex and make sure it's what it's supposed to be.

    Note that following the above instructions should produce a scanner which can be used easily by a parser written in bison. And writing such a parser wouldn't be particularly difficult, except... what should it return? A parse tree, naturally, and we don't have such a type already defined in C++, so we would have to define one, which I think would take longer than the time remaining in this course. Maybe we'll do that in Data Structures next fall....

  7. Exercises in bison

  8. Write, in bison, a filter that reads an HTML or XML file and tells whether the tags are legally nested. For example,

    <em>This is emphasized <font size="big">big</font> text</em>
    is correctly nested, but
    <em>This is emphasized <font size="big">big</em> text</font>
    is not, because the outer tag ends before the inner one, and
    <em>This is emphasized <font size="big">big</em> text
    is not, because one of the tags isn't closed at all, and
    <em>This is emphasized big</em> text</font>
    is not, because one of the tags is closed without ever being opened.

    It would be neat if you could handle any HTML or XML tag, but if you can handle two or three kinds (say, <font size="...">, <em>, and <em>), that's enough. Don't worry about what the tags or their parameters mean; your only concern is whether they're nested legally.

  9. Write, in bison, an interpreter for our AE language. It should use the scanner from problem 6 above, and include the same header file as above (which I've written to be compatible with bison). It should return the numeric value of an expression, or an error message if the expression isn't syntactically legal. Since we don't have “with” in the language, there will be no variables with values, so the only identifiers appearing anywhere should be +, -, *, and /, each appearing only just after a left-curly-brace; your bison program should decide which of these four identifiers it saw and compute the answer appropriately from the values of the operands.

    (You may get some useful ideas from ch301.y, one of the examples from the lex/yacc book.)


    Last modified: Wed Apr 18 11:36:01 EDT 2012
    Stephen Bloch / sbloch@adelphi.edu