Scanning and Parsing - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Scanning and Parsing

Description:

Using SAX. Override the class SaxHandler. Override as ... Note that DOM uses SAX to build the in-memory tree. By: Jonathan D'Andries. Getting elements out ... – PowerPoint PPT presentation

Number of Views:689
Avg rating:3.0/5.0
Slides: 52
Provided by: MarkGu
Category:
Tags: parsing | sax | scanning

less

Transcript and Presenter's Notes

Title: Scanning and Parsing


1
Scanning and Parsing
  • CS2340

2
Scanning and Parsing in Squeak
  • Defining Lexical and Syntactic Analysis
  • Scanning/Tokenizing
  • Parsing
  • Easy ways to do it
  • State Transition Tables
  • Recursive Descent Parsing
  • More sophisticated way
  • T-Gen Lex and YACC for Squeak
  • SmaCC
  • XML SAX, DOM
  • SIXX
  • Examples from Squeak
  • Smalltalk parser
  • HTML parser

3
Challenge of Compiling
  • How do you go from source code to object code?
  • Lexical analysis Figure out the pieces (tokens)
    of the program Constants, variables, keywords.
  • Syntactic analysis Figure out (and check) the
    structure (declaration, statemtents, etc.)also
    called parsing
  • Interpret meaning (semantics) of recognized
    structures, based on declarations
  • Backend Generate object code

4
Lexical Analysis
  • Given a bunch of characters, how do you recognize
    the key things and their types?
  • Simplest way Parse by white space
  • 'This is
  • a test
  • with returns in it.' findTokens (Character cr
    asString),(Character space asString).
  • OrderedCollection ('This' 'is' 'a' 'test' 'with'
    'returns' 'in' 'it.' )

5
But
  • if(xgty)
  • xy findTokens (Character cr
    asString),(Character space asString).
  • OrderedCollection ( if(xgty) xy)
  • Not what we want!

6
Scanning Doing It Right
  • Read in characters one-at-a-time
  • Recognize when an important token has arrived
  • Return the type and the value of the token

7
A Theoretical Tool for Scanning FSA's
  • Finite State Automata (FSA)
  • One model of computation that can scan well
  • We can make them fast and efficient
  • FSA's are
  • A collection of states
  • Arcs between states
  • Labeled with input symbols

8
Example FSA
  • State 1 is start state
  • Incoming arrow
  • "Incomplete state" can't end there
  • State 2 is terminal or end statecan stop there,
    recognizing a token
  • Consume A's in 1, end with a B in 2
  • Valid AB, AAB, AAAB

9
General FSA Processing
  • Enter the Start state
  • Read input
  • Go to the state for that input
  • If an End state, can stop
  • But may not want to, since we must find the
    longest possible token (consider scanning an
    identifier)

10
Implementing FSAs
  • Easiest way State Transition Tables
  • Read a character
  • Append character to VALUE
  • Using a table indexed by states and characters,
    find a new state to move to given current STATE
    and input CHARACTER
  • If end state and no more transitions possible,
    return VALUE and STATE
  • (Sometimes need to do a lookahead. Could I grab
    the next character and be in another end state?)

11
Syntactic Analysis
  • Given the tokens, can we recognize the language?
  • Parsing
  • Structure for describing relationship between
    tokens is called a grammar
  • A grammar describes how tokens can be assembled
    into an acceptable sentence in a language
  • We're going to study a kind called context-free
    grammars

12
Context-free grammars
  • Made up of a set of rules
  • Each rule consists of a left-hand side
    non-terminal which maps to a right-hand side
    expression
  • Expressions are made up of other non-terminals
    and terminals
  • Rules can be used as replacements
  • Either side can be replaced with the other

13
Example grammar
  • Expression Factor Expression
  • Expression Factor
  • Factor Term Factor
  • Factor Term
  • Term Number
  • Term Identifier (variable)

14
Derivation tree using grammar for 345
15
Implementing Parsing
  • Simplest way Recursive descent parsing
  • Each non-terminal maps to a method/function/proced
    ure in language
  • The m/f/p is responsible for recognizing the
    related non-terminal
  • Including calling another m/f/p as needed
  • Use your scanner to supply tokens

16
A Simple Equation Recursive Descent Parser
  • Expression Factor Expression
  • Expression Factor
  • expression
  • Transcript show 'Expression' cr.
  • self factor.
  • (scanner peek '')
  • ifTrue Transcript show '' cr.
  • scanner advance.
  • self
    expression.

17
Factor and Term Simple RD Parsing
  • Factor Term Factor
  • Factor Term
  • Term Number
  • factor
  • Transcript show 'Factor' cr.
  • self term.
  • (scanner peek '') ifTrue
  • Transcript show ''
    cr.
  • scanner advance
  • self
    factor.
  • term
  • Transcript show 'Term' cr.
  • (scanner nextIsNumber)
  • ifTrue Transcript show
    'Number ',(scanner nextToken) cr.
  • ifFalse Transcrpt show
    Error -- Number expected

18
Simulating a Scanner
  • tokens aCollection
  • tokens aCollection
  • peek
  • tokens isEmpty
  • ifTrue nil
  • ifFalse tokens first
  • advance
  • tokens tokens allButFirst.

19
Simulating a Scanner
  • nextIsNumber
  • (tokens first select character
  • character asciiValue lt 0
    asciiValue or
  • character asciiValue gt 9
    asciiValue) isEmpty
  • nextToken
  • token
  • token self peek.
  • token isNil ifFalse self advance.
  • token.

20
Trying out the toy parser
  • eqn EquationParser new.
  • eqnscan EquationScanner new.
  • eqn scanner eqnscan.
  • eqnscan tokens ('3 4 5' findTokens
    (Character space asString)).
  • eqn expression

21
Comparing to the earlier derivation tree 3 4
5
  • Transcript
  • Expression
  • Factor
  • Term
  • Number 3
  • Factor
  • Term
  • Number 4
  • Expression
  • Factor
  • Term
  • Number 5

22
Derivation tree for 3 4 5
Expression
  • Transcript
  • Expression
  • Factor
  • Term
  • Number 3
  • Expression
  • Factor
  • Term
  • Number 4
  • Factor
  • Term
  • Number 5

Factor Expression
Factor
Term
Term Factor
Number
Number
Term
3
Number
4
5
23
Scanning and Parsing in Squeak
  • Defining Lexical and Syntactic Analysis
  • Scanning/Tokenizing
  • Parsing
  • Easy ways to do it
  • State Transition Tables
  • Recursive Descent Parsing
  • More sophisticated way
  • T-Gen Lex and YACC for Squeak
  • SmaCC
  • XML SAX, DOM
  • SIXX
  • Examples from Squeak
  • Smalltalk parser
  • HTML parser

24
T-Gen A Translator Generator for Squeak
25
Using T-Gen
  • Link to changeSet on co-web (Software page)
  • In Morphic, TGenUI open
  • Enter your tokens as regular expressions in
    upper-left
  • Enter your grammar in lower-left
  • Put in sample code in lower-right
  • Transcript for parsing is upper-right
  • Processing of each occurs as soon as you accept
    (Alt/Cmd-S)
  • From the transcript pane, you can inspect result
  • Buttons let you specify kind of parser and kind
    of result
  • You can install the resultant scanner and parser
    into your system

26
smaCC
  • Smalltalk Compiler-Compiler freely available
    parser generator
  • replacement for the T-Gen parser generator
  • overcomes T-Gen's limitations
  • can generate parsers for ambiguous grammars and
    grammars with overlapping tokens
  • smaller runtime than T-Gen
  • faster than T-Gen
  • Available via SqueakMap
  • Tutorial at http//www.refactory.com/Software/SmaC
    C/Tutorial.html

27
XML Vocabulary
  • XML Extensible Markup Language
  • Designed to describe data and focus on what the
    data is
  • Vs. HTML display data and focus on how data
    looks.
  • It doesnt do anything, it describes data via
    tags and values.
  • Tutorial http//www.w3schools.com/xml/xml_whatis.
    asp

28
XML
  • Must have open/close tags
  • Must be properly nested
  • Always have a root element
  • Parsed document forms a tree structure
  • Can be commented
  • lt!-- This is a comment --gt
  • Is case sensitive ltNamegt ! ltnamegt
  • Can have attributes ltperson sexmalegt

29
Sample XML Description
ltCustomerListgt ltCompanyNamegtExtroon
Incorporatedlt/CompanyNamegt ltCompanyPhonegt770-555-1
212lt/CompanyPhonegt ltcustomergt ltnamegtBob
Waterslt/namegt ltidgt126423lt/idgt ltaddrgt 1313
MockingBird Lane lt/addrgt lt/customergt ltcustomergt ltn
amegtSally Smithlt/namegt ltidgt559382lt/idgt ltaddrgt
1212 Sunnyvale Retirement Homelt/addrgt lt/customergt
lt/CustomerListgt
30
Well-Formed vs. Valid XML
  • Just because it is well-formed (syntactically
    correct) doesnt mean the data is correct
  • Need to specify what the data is supposed to look
    like for the information to be valid
  • Can use either Document Type Definition (DTD) or
    schemas

31
Sample Document Type Defn
lt!DOCTYPE CustomerList lt!ELEMENT CompanyName
(PCDATA)gt lt!ELEMENT CompanyPhone (PCDATA)gt
lt!ELEMENT customers (customer)gt lt!ELEMENT
customer (name,id,addr)gt lt!ELEMENT name
(PCDATA)gt lt!ELEMENT id (PCDATA)gt
lt!ELEMENT addr (PCDATA)gt gt
32
Sample Schema
  • lt?xml version"1.0"?gt
  • ltxsschema xmlnsxs"http//www.w3.org/2001/XMLSch
    ema" targetNamespace"http//www.cc.gatech.edu/cs2
    340" xmlns"http//www.cc.gatech.edu/cs2340"
    elementFormDefault"qualified"gt
  • ltxselement nameCustomerList"gt
  • ltxscomplexTypegt
  • ltxssequencegt
  • ltxselement nameCompanyName"
    type"xsstring"/gt
  • ltxselement nameCompanyPhone
    type"xsstring"/gt

33
Schema Continued
  • ltxselement namecustomer" /gt
  • ltxscomplexTypegt
  • ltxssequencegt
  • ltxselement namename
    typexsstring/gt
  • ltxselement nameid" type"xsstring"/gt
  • ltxselement nameaddr"
    type"xsstring"/gt lt/xssequencegt
  • lt/xscomplexTypegt
  • lt/xselementgt
  • lt/xscomplexTypegt
  • lt/xselementgt
  • lt/xsschemagt

34
Parsing XML
  • You could do it yourself..
  • SAX Simple API for XML
  • Event-Based
  • Report parsing events and handle as they happen
  • www.saxproject.org
  • DOM Document Object Model
  • Tree-Based
  • Parse entire doc into tree, then query
  • www.w3.org/DOM

35
SAX Example
  • lt?xml version"1.0"?gt
  • ltdocgt
  • ltparagtHello, world!lt/paragt
  • lt/docgt

start document start element doc start
element para characters Hello, world! end
element para end element doc end document
36
Using SAX
  • Override the class SaxHandler
  • Override as necessary the messages
  • startDocument
  • endDocument
  • startElement aName attributeList attributes
  • endElement aName
  • characters aString

37
For DOM, we get document first
fFileStream fileNamed 'samplexml2.xml'.
xXMLDOMParser parseDocumentFrom f.
X now contains an object of type XMLDocument
Note that DOM uses SAX to build the in-memory
tree.
38
By Jonathan DAndries
39
Getting elements out
document elements returns an OrderedCollection
of elements in the document
(document elements) at 1 gets us the root
XMLElement document topElement document
elementAt rootElementName
We can then use the firstTagNamed customer
We can also use tagsNamed customer do aBlock
to execute the same code for each tag block.
40
Making document from scratch
  • createHeader
  • aTopElement
  • document XMLDocument new.
  • aTopElement XMLElement named 'CustomerList
  • attributes Dictionary new.
  • aTopElement addElement (self makeSubElement
  • 'CompanyName' content 'FooBar Inc').
  • aTopElement addElement (self makeSubElement
  • 'CompanyPhone' content
    '990-555-1345').
  • document addElement aTopElement

41
Making a string subelement
makeSubElement aTagName content aStringContent
anXMLElement anXMLElement
XMLElement named aTagName
attributes Dictionary new. anXMLElement
addContent (XMLStringNode string
aStringContent). anXMLElement
42
Making a subgroup
createCustomer aName id anId status aStatus
top aCustElement top document
topElement. aCustElement XMLElement named
'Customer' attributes
Dictionary new. aCustElement attributeAt
'status' put aStatus. aCustElement addElement
(self makeSubElement 'name
content aName). aCustElement addElement
(self makeSubElement 'id'
content anId). top addElement aCustElement
43
SIXX
  • Smalltalk Instance eXchange in XML
  • SIXX is an XML serializer/deserializer
  • Store and load Smalltalk objects in a portable,
    dialect-independent XML format.
  • Pointer on co-web

44
Using SIXX
  • SixxWriteStream and SixxReadStream
  • write/read Smalltalk objects like a
    binary-object-stream way.
  • Writing objects to an external file
  • sws SixxWriteStream newFileNamed 'obj.sixx'.
  • sws nextPut ltobjectgt.
  • sws nextPutAll ltcollection of objectgt.
  • sws close.
  • And to read objects from an external file
  • srs SixxReadStream readOnlyFileNamed
    'obj.sixx'. objects srs contents.
  • srs close.

45
Scanning and Parsing in Squeak
  • Defining Lexical and Syntactic Analysis
  • Scanning/Tokenizing
  • Parsing
  • Easy ways to do it
  • State Transition Tables
  • Recursive Descent Parsing
  • More sophisticated way
  • T-Gen Lex and YACC for Squeak
  • SmaCC
  • XML SAX, DOM
  • SIXX
  • Examples from Squeak
  • Smalltalk parser
  • HTML parser

46
Smalltalk Parser
47
Smalltalk's Parser is Recursive Descent!
  • Scanner methods are in Parser
  • Scanning method category advance endOfLastToken
    match matchToken startOfNextToken
  • All the kinds of messages are defined in
    Expression Types
  • argumentName assignment blockExpression
    braceExpression cascade expression
    messagePartrepeat methodcontext
    patterninContext primaryExpression
    statementsinnerBlock temporaries
    temporaryBlockVariables variable

48
Example Parsing an Assignment
  • assignment varNode
  • " var '' expression gt AssignmentNode."
  • loc
  • (loc varNode assignmentCheck encoder at
    prevMark requestorOffset) gt 0
  • ifTrue self notify 'Cannot store into' at
    loc.
  • varNode nowHasDef.
  • self advance.
  • self expression ifFalse self expected
    'Expression'.
  • parseNode AssignmentNode new
  • variable varNode
  • value parseNode
  • from encoder.
  • true

49
HtmlParser
  • Used for Scamper
  • HtmlParser parse '
  • lthtmlgt
  • ltheadgt
  • lttitlegtFred the Pagelt/titlegt
  • lt/headgt
  • ltbodygt
  • lth1gtFred the Bodylt/h1gt
  • This is a body for Fred.
  • lt/bodygt
  • lt/htmlgt'

50
HtmlParser returns an HtmlDocument
  • HtmlDocument has contents, which is an
    OrderedCollection
  • HtmlHead
  • HtmlBody
  • HtmlEntityHierarchy exists

51
Walk the Object Structure
  • doc HtmlParser parse '
  • lthtmlgt
  • ltheadgt
  • lttitlegtFred the Pagelt/titlegt
  • lt/headgt
  • ltbodygt
  • lth1gtFred the Bodylt/h1gt
  • This is a body for Fred.
  • lt/bodygt
  • lt/htmlgt'.
  • body doc contents last. "This should be an
    HtmlBody"
  • body contents detect entity entity isKindOf
    HtmlHeader. "This should be the first heading."
  • PrintIt
  • lt'h1'gt
  • Fred the Body
Write a Comment
User Comments (0)
About PowerShow.com