Title: JavaCC
1JavaCC
2What is a parser generator
T o t a l p r i c e t a x
Scanner
Total price tax
Parser
assignment
Expr
Total
Parser generator (JavaCC)
id id
price
tax
lexicalgrammar specification
3JavaCC
- JavaCC (Java Compiler Compiler) is a scanner and
parser generator its unusual in this regard - Produce a scanner and/or a parser written in
java, itself is also written in Java - There are many parser generators.
- yacc (Yet Another Compiler-Compiler) for C
programming language (See Dragon book chapter
4.9) - Bison from gnu.org
- There are also many parser generators written in
Java - JavaCUP Well look at this one latter
- ANTLR
- SableCC
4More on classification of java parser generators
- Bottom up Parser Generators Tools
- JavaCUP
- jay, YACC for Java www.inf.uos.de/bernd/jay
- SableCC, The Sable Compiler Compiler
www.sablecc.org - Topdown Parser Generators Tools
- ANTLR, Another Tool for Language Recognition
www.antlr.org - JavaCC, Java Compiler Compiler www.webgain.com/jav
a_cc
5Features of JavaCC
- TopDown LL(K) parser genrator
- Lexical and grammar specifications in one file
- Tree Building preprocessor
- with JJTree
- Extreme Customizable
- many different options selectable
- Document Generation
- by using JJDoc
- Internationalized
- can handle full unicode
- Syntactic and Semantic lookahead
6Features of JavaCC (contd)
- Permits extneded BNF specifications
- can use ? () at RHS.
- Lexical states and lexical actions
- Case-insensitive lexical analysis
- Extensive debugging capability
- Special tokens
- Very good error reporting
7JavaCC Installation
- Download the file javacc-3.X.zip from
https//javacc.dev.java.net/ - Follow the link this says Download or go directly
to https//javacc.dev.java.net/servlets/ProjectDo
cumentList - unzip javacc-3.X.zip to a directory JCC_HOME
- add JCC_HOME\bin directory to your path.
- javacc, jjtree, jjdoc may now be invoked
directly from the command line.
8Steps to use JavaCC
- Write a javaCC specification (.jj file)
- Defines the grammar and actions in a file (say,
calc.jj) - Run javaCC to generate a scanner and a parser
- javacc calc.jj
- Will generate parser, scanner, token, java
sources - Write your program that uses the parser
- For example, UseParser.java
- Compile and run your program
- javac -classpath . .java
- java -cp . mainpackage.MainClass
9Example 1
Parse a spec of regular expressions and match it
with input strings
- Grammar re.jj
- Example
- all strings ending in "ab"
- (ab)ab
- aba
- ababb
- Our tasks
- For each input string (Line 3,4) determine
whether it matches the regular expression (line
2).
10the overall picture
comment (ab)ab a ab
11Format of a JavaCC input Grammar
- javacc_options
- PARSER_BEGIN ( ltIDENTIFIERgt1 )
- java_compilation_unit
- PARSER_END ( ltIDENTIFIERgt2 )
- ( production )
-
12the input spec file (re.jj)
- options
- USER_TOKEN_MANAGERfalse
- BUILD_TOKEN_MANAGERtrue
- OUTPUT_DIRECTORY"./reparser"
- STATICfalse
-
13re.jj
- PARSER_BEGIN(REParser)
- package reparser
- import java.lang.
-
- import dfa.
- public class REParser
- public FA tg new FA()
- // output error message with current line
number - public static void msg(String s)
- System.out.println("ERROR"s)
-
- public static void main(String args) throws
Exception -
- REParser reparser new REParser(System.in)
-
- reparser.S()
14re.jj (Token definition)
- TOKEN
- ltSYMBOL "0"-"9","a"-"z","A"-"Z" gt
- ltEPSILON "epsilon" gt
- ltLPAREN "( gt
- ltRPAREN ") gt
- ltOR "" gt
- ltSTAR " gt
- ltSEMI " gt
-
- SKIP
- lt ( " ","\t","\n","\r","\f" ) gt
-
- lt "" ( "\n" ) "\n" gt
System.out.println(image)
15re.jj (productions)
- void S() FA d1
-
- d1 R() ltSEMIgt
- tg d1 System.out.println("------NFA") tg.
print() - System.out.println("------DFA")
- tg tg.NFAtoDFA() tg.print()
- System.out.println("------Minimize")
- tg tg.minimize() tg.print()
- System.out.println("------Renumber")
- tgtg.renumber() tg.print()
- System.out.println("------Execute")
-
- testCases()
16re.jj
- void testCases()
- (testCase() )
- void testCase() String testInput
- testInput symbols()
- ltSEMIgt
- tg.execute( testInput)
-
- String symbols()
- Token token null StringBuffer result new
StringBuffer() -
- (
- token ltSYMBOLgt
- result.append( token.image)
- )
- return result.toString()
17re.jj (regular expression)
- // R --gt RUnit RConcat RChoice
- FA R() FA result
- result RChoice() return
result - FA RUnit()
- FA result Token d1
-
- (
- ltLPARENgt result RChoice() ltRPARENgt
-
- ltEPSILONgt result tg.epsilon()
-
- d1 ltSYMBOLgt result tg.symbol( d1.image
) - )
- return result
-
18re.jj
- FA RChoice() FA result, temp
-
- result RConcat()
- ( ltORgt temp RConcat() result
result.choice( temp ) ) - return result
- FA RConcat() FA result, temp
- result RStar()
- ( temp RStar() result
result.concat( temp ) ) - return result
- FA RStar() FA result
- result RUnit()
- ( ltSTARgt result result.closure()
) - return result
19Format of a JavaCC input Grammar
- javacc_input javacc_options
- PARSER_BEGIN ( ltIDENTIFIERgt1 )
- java_compilation_unit
- PARSER_END ( ltIDENTIFIERgt2 )
- ( production )
- ltEOFgt
- color usage
- blue --- nonterminal
- ltorangegt a token type
- purple --- token lexeme ( reserved word
- I.e.,
consisting of the literal itself.) - black -- meta symbols
20Notes
- ltIDENTIFIERgt means any Java identifers like var,
class2, - IDENTIFIER means IDENTIFIER only.
- ltIDENTIFIERgt1 must ltIDENTIFIERgt2
- java_compilation_unit is any java code that as a
whole can appear legally in a file. - must contain a main class declaration with the
same name as ltIDENTIFIERgt1 . - Ex
- PARSER_BEGIN ( MyParser )
- package mypackage
- import myotherpackage.
- public class MyParser
- class MyOtherUsefulClass
- PARSER_END (MyParser)
21The input and output of javacc
(MyLangSpec.jj )
javacc
Token.java
- PARSER_BEGIN ( MyParser )
- package mypackage
- import myotherpackage.
- public class MyParser
- class MyOtherUsefulClass
- PARSER_END (MyParser)
ParserError.java
MyParser.java
MyParserTokenManager.java
MyParserCostant.java
22Notes
- Token.java and ParseError.jar are the same for
all input and can be reused. - package declaration in .jj are copied to all 3
outputs. - import declarations in .jj are copied to the
parser and token manager files. - parser file is assigned the file name
ltIDENTIFIERgt1 .java - The parser file has contents
- class MyParser
- //generated parser is inserted here.
-
- The generated token manager provides one public
method - Token getNextToken() throws ParseError
23Lexical Specification with JavaCC
24javacc options
- javacc_options
- options ( option_binding )
- option_binding are of the form
- ltIDENTIFIERgt3 ltjava_literalgt
- where ltIDENTIFIERgt3 is not case-sensitive.
- Ex
- options
- USER_TOKEN_MANAGERtrue
- BUILD_TOKEN_MANAGERfalse
- OUTPUT_DIRECTORY"./sax2jcc/personnel"
- STATICfalse
-
25More Options
- LOOKAHEAD
- java_integer_literal (1)
- CHOICE_AMBIGUITY_CHECK
- java_integer_literal (2) for A B C
- OTHER_AMBIGUITY_CHECK
- java_integer_literal (1) for (A), (A) and
(A)? - STATIC (true)
- DEBUG_PARSER (false)
- DEBUG_LOOKAHEAD (false)
- DEBUG_TOKEN_MANAGER (false)
- OPTIMIZE_TOKEN_MANAGER
- java_boolean_literal (false)
- OUTPUT_DIRECTORY (current directory)
- ERROR_REPORTING (true)
26More Options
- JAVA_UNICODE_ESCAPE (false)
- replace \u2245 to actual unicode (6 char ? 1
char) - UNICODE_INPUT (false)
- input strearm is in unicode form
- IGNORE_CASE (false)
- USER_TOKEN_MANAGER (false)
- generate TokenManager interface for users own
scanner - USER_CHAR_STREAM (false)
- generate CharStream.java interface for users
own inputStream - BUILD_PARSER (true)
- java_boolean_literal
- BUILD_TOKEN_MANAGER (true)
- SANITY_CHECK (true)
- FORCE_LA_CHECK (false)
- COMMON_TOKEN_ACTION (false)
- invoke void CommonTokenAction(Token t) after
every getNextToken() - CACHE_TOKENS (false)
27Example Figure 2.2
- if IF
- a-za-z0-9 ID
- 0-9 NUM
- (0-9.0-9) (0-9.0-9) REAL
- (--a-z\n) ( \n \t )
nonToken, WS - . error
- javacc notations ?
- if or i f or if
- a-z(a-z,0-9)
- (0-9)
- (0-9) . ( 0-9 )
- (0-9) . (0-9)
28JvaaCC Spec for Some Tokens
- PARSER_BEGIN(MyParser) class MyParser
- PARSER_END(MyParser)
- / For the regular expressin on the right, the
token on the left will be returned / - TOKEN
- lt IF if gt
- lt DIGIT 0-9 gt
- lt ID a-z ( a-z
ltDIGITgt) gt - lt NUM (ltDIGITgt) gt
- lt REAL ( (ltDIGITgt) . (ltDIGITgt) )
- ( ltDIGITgt . (ltDIGITgt) ) gt
-
29Continued
- / The regular expression here will be skipped
during lexical analysis / - SKIP lt gt lt\tgt lt\ngt
- / like SKIP but skipped text accessible from
parser action / - SPECIAL_TOKEN
- lt-- (a-z) (\n \r \n\r ) gt
-
- / . For any substring not matching lexical spec,
javacc will throw an error / - / main rule /
- void start()
- (ltIFgt ltIDgt ltNUMgt ltREALgt)
30Grammar Specification with JavaCC
31The Form of a Production
- java_return_type java_identifier (
java_parameter_list ) - java_block
- expansion_choices
- EX
- void XMLDocument(Logger logger) int msg 0
- ltStartDocgt print(token)
- Element(logger)
- ltEndDocgt print(token)
- else()
32Example ( Grammar 3.30 )
- P ? L
- S ? id id
- S ? while id do S
- S ? begin L end
- S ?if id then S
- S ? if id then S else S
- L? S
- L? LS
- 1,7,8 P ? S (S)
33JavaCC Version of Grammar 3.30
- PARSER_BEGIN(MyParser)
- pulic class MyPArser
- PARSRE_END(MyParser)
- SKIP \t \n
- TOKEN
- ltWHILE whilegt ltBEGIN begingt
ltENDendgt - ltDOdogt ltIFifgt
ltTHEN thengt - ltELSEelsegt ltSEMI gt
ltASSIGN gt - ltLETTER a-zgt
- ltID ltLETTERgt(ltLETTERgt 0-9 ) gt
-
34JavaCC Version of Grammar 3.30 (contd)
- void Prog() StmList() ltEOFgt
- void StmList()
- Stm() ( Stm() )
-
- void Stm()
- ltIDgt ltIDgt
- while ltIDgt do Stm()
- ltBEGINgt StmList() ltENDgt
- if ltIDgt then Stm() LOOKAHEAD(1) else
Stm()
35Types of productions
- production javacode_production
- regulr_expr_production
- bnf_production
- token_manager_decl
- Note
- 1,3 are used to define grammar.
- 2 is used to define tokens
- 4 is used to embed code into token manager.
36JAVACODE production
- javacode_production JAVACODE
- java-return_type iava_id (
java_param_list ) - java_block
- Note
- Used to define nonterminals for recognizing sth
that is hard to parse using normal production.
37Example JAVACODE
- JAVACODE void skip_to_matching_brace()
-
- Token tok
- int nesting 1
- while (true)
- tok getToken(1)
- if (tok.kind LBRACE) nesting
- if (tok.kind RBRACE)
- nesting--
- if (nesting 0) break
- tok getNextToken()
38Note
- Do not use nonterminal defined by JAVACODE at
choice point without giving LOOKHEAD. - void NT()
- skip_to_matching_brace()
- some_other_production()
-
- void NT()
- "" skip_to_matching_brace()
- "(" parameter_list() ")"
-
39TOKEN_MANAGER_DECLS
- token_manager_decls
- TOKEN_MGR_DECLS java_block
- The token manager declarations starts with the
reserved word "TOKEN_MGR_DECLS" followed by a ""
and then a set of Java declarations and
statements (the Java block). - These declarations and statements are written
into the generated token manager
(MyParserTokenManager.java) and are accessible
from within lexical actions. - There can only be one token manager declaration
in a JavaCC grammar file.
40regular_expression_production
- regular_expr_production
- lexical_state_list
- regexpr_kind IGNORE_CASE
- regexpr_spec ( regexpr_spec )
- regexpr_kind
- TOKEN SPECIAL_TOKEN SKIP MORE
- TOKEN is used to define normal tokens
- SKIP is used to define skipped tokens (not passed
to later parser) - MORE is used to define semi-tokens (I.e. only
part of a token). - SPECIAL_TOKEN is between TOKEN and SKIP tokens in
that it is passed on to the parser and accessible
to the parser action but is ignored by production
rules (not counted as an token). Useful for
representing comments.
41lexical_state_list
- lexical_state_list
- lt gt lt java_identifier ( , java_identifier )
gt - The lexical state list describes the set of
lexical states for which the corresponding
regular expression production applies. - If this is written as "ltgt", the regular
expression production applies to all lexical
states. Otherwise, it applies to all the lexical
states in the identifier list within the angular
brackets. - if omitted, then a DEFAULT lexical state is
assumed.
42regexpr_spec
- regexpr_spec
- regular_expression1 java_block
java_identifier - Meaning
- When a regular_expression1 is matched then
- if java_block exists then execute it
- if java_identifier appears, then transition to
that lexical state.
43regular_expression
- regular_expression
- java_string_literal
- lt java_identifier
complex_regular_expression_choices gt - ltjava_identifiergt
- ltEOFgt
- ltEOFgt is matched by end-of-file character only.
- (3) ltjava_identifiergt is a reference to other
labeled regular_expression. - used in bnf_production
- java_string_literal is matched only by the
string denoted by itself. - (2) is used to defined a labled regular_expr and
not visible to outside the current TOKEN section
if occurs. - (1) for unnamed tokens
44Example
- ltDEFAULT, LEX_ST2gt TOKEN IGNORE_CASE
- lt FLOATING_POINT_LITERAL
- ("0"-"9") "." ("0"-"9") (ltEXPONENTgt)?
("f","F","d","D")? - "." ("0"-"9") (ltEXPONENTgt)?
("f","F","d","D")? - ("0"-"9") ltEXPONENTgt ("f","F","d","D")?
- ("0"-"9") (ltEXPONENTgt)? "f","F","d","D" gt
- // do Something LEX_ST1
- lt EXPONENT "e","E" ("","-")?
("0"-"9") gt -
- Note if is omitted, E123 will be recognized
erroneously - as a token of kind EXPONENT.
45Structure of complex_regular_expression
- complex_regular_expression_choices
- complex_regular_expression ( complex_regular_exp
ression ) - complex_regular_expression
- ( complex_regular_expression_
unit ) - complex_regular_expression_unit
- java_string_literal "lt"
java_identifier "gt" - character_list
- ( complex_regular_expression_choices )
? - Note
- unit ?concatenationjuxtaposition?
complex_regular_expression ?choice ?
complex_regular_expression_choice ?(.)? ? - unit
46character_list
- character_list
- character_descriptor ( ,
character_descriptor ) - character_descriptor
- java_string_literal - java_string_literal
- java_string_literal // reference to java
grammar - singleCharString
- note java_sting_literal here is restricted to
length 1. - ex
- a,b --- all chars but a and b.
- a-f, 0-9, A,B,C,D,E,F ---
hexadecimal digit. - a,b is not a regular_expression_unit. Why
? - should be written ( a,b ) instead.
47bnf_production
- bnf_production
- java_return_type java_identifier "("
java_parameter_list ")" "" - java_block
- "" expansion_choices "
- expansion_choices expansion ( "" expansion
) - expansion ( expansion_unit )
48expansion_unit
- expansion_unit
- local_lookahead
- java_block
- "(" expansion_choices ")" "" ""
"?" - "" expansion_choices ""
- java_assignment_lhs ""
regular_expression - java_assignment_lhs ""
- java_identifier "(" java_expression_list ")
- Notes
- 1 is for lookahead 2 is for semantic action
- 4 ( )?
- 5 is for token match
- 6. is for match of other nonterminal
49lookahead
- local_lookahead "LOOKAHEAD" "("
java_integer_literal ","
expansion_choices "," "" java_expression
"" ") - Notes
- 3 componets max lookahead syntax semantics
- examples
- LOOKHEAD(3)
- LOOKAHEAD(5, Expr() ltINTgt ltREALgt , true )
- More on LOOKAHEAD
- see minitutorial
50JavaCC API
- Non-Terminals in the Input Grammar
- NT is a nonterminal gt
- returntype NT(parameters) throws ParseError
- is generated in the parser class
- API for Parser Actions
- Token token
- variable always holds the last token and can be
used in parser actions. - exactly the same as the token returned by
getToken(0). - two other methods - getToken(int i) and
getNextToken() can also be used in actions to
traverse the token list.
51Token class
- public int kind
- 0 for ltEOFgt
- public int beginLine, beginColumn, endLine,
endColumn - public String image
- public Token next
- public Token specialToken
- public String toString()
- return image
- public static final Token newToken(int ofKind)
52Error reporting and recovery
- It is not user friendly to throw an exception and
exit the parsing once encountering a syntax
error. - two Exceptions
- ParseException . ? can be recovered
- TokenMgrError ? not expected to be recovered
- Error reporting
- modify ParseExcpetion.java or TokenMgeError.java
- generateParseException method is always
invokable in parser action to report error
53Error Recovery in JavaCC
- Shallow Error Recovery
- Deep Error Recovery
- Shallow Error Recovery
- Ex
- void Stm()
- IfStm()
- WhileStm()
- if getToken(1) ! if or while gt shallow
error
54Shallow recovery
- can be recovered by additional choice
- void Stm()
- IfStm()
- WhileStm()
- error_skipto(SEMICOLON)
-
- where
- JAVACODE
- void error_skipto(int kind)
- ParseException e generateParseException() //
generate the exception object. - System.out.println(e.toString()) // print the
error message - Token t
- do t getNextToken() while (t.kind !
kind)
55Deep Error Recovery
- Same example void Stm() IfStm()
WhileStm() - But this time the error occurs during paring
inside IfStmt() or WhileStmt() instead of the
lookahead entry. - The approach use java try-catch construct.
- void Stm()
- try
- ( IfStm() WhileStm() )
- catch (ParseException e)
- error_skipto(SEMICOLON)
-
-
- note the new syntax for javacc bnf_production.
56More Examples
- There are plenty examples on the net
- http//www.vorlesungen.uni-osnabrueck.de/informati
k/compilerbau98/code/JavaCC/examples/ - JavaCC Grammar Repository
- http//www.cobase.cs.ucla.edu/pub/javacc/
57References
- http//xml.cs.nccu.edu.tw/courses/compiler/cp2003F
all/slides/javaCC.ppt - Compilers Principles, Techniques and Tools, Aho,
Sethi, and Ullman