Markup - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Markup

Description:

Smoothing a model is the replacement of some ... Adverb, deictic. AVQ. WRB. WRB. when. Adverb, question. AVQ. WRB. WQL. how. Adv, question, intensifier ... – PowerPoint PPT presentation

Number of Views:395
Avg rating:3.0/5.0
Slides: 23
Provided by: VasileiosH9
Category:
Tags: deictic | markup

less

Transcript and Presenter's Notes

Title: Markup


1
Markup
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Smoothing
  • Smoothing a model is the replacement of some
    probability estimates with other estimates
    believed to be more reliable
  • Avoid estimating zero probabilities by always
    assigning a non-zero (very small) probability to
    all possible events
  • Additive smoothing Smooth by increasing observed
    counts by a constant

3
Two kinds of smoothing
  • Backoff smoothing Estimate using the probability
    of a more general event that directly includes
    the event in question
  • Class-based smoothing Estimate using the
    probability of a class of events that is
    independently defined but is correlated with our
    event

4
Performance of statistical sentence separation
  • Decision trees (1989)
  • Neural networks (1997) 98-99 accuracy using the
    class probabilities
  • Maximum entropy (1997) 99.25 accuracy

5
Markup
  • A text, once processed, is usually stored with
    additional information in the form of markup
  • Markup gives structural and possibly
    classification information about the elements of
    the text (words and sentences)
  • Need to be aware of markup to use it and not
    include it in the source text

6
Markup functions
  • Describe the document (source, name, author,
    date, etc.)
  • Separate paragraphs, sentences, and words
  • Add syntactic and/or semantic information to each
    word
  • Display markup (font, font size, positioning)
    irrelevant for our purposes

7
Ad Hoc Markup
  • Initially, designed for each corpus as the need
    arose
  • In early Wall Street Journal corpus (ACL-DCI,
    1990)
  • Fixed strings for document start/end, date,
    source
  • Each word on a single line
  • Special markers for start/end of sentence
  • Words annotated as word/part-of-speech, e.g.,
    dog/NN, and/CC, enter/VB

8
SGML
  • Standard Generalized Markup Language
  • ISO standard (1986)
  • Allows for the description of tree structure in
    text form
  • SGML documents consist of
  • characters (from a defined character set)
  • entity and character references
  • elements

9
SGML references
  • Character references for specifying non-letter
    characters, e.g., 038 stands for
  • Form is number or xnumber
  • Entity references expand to predefined strings
    that can be part of the standard (such as amp
    for ) or dynamically vary (docdate)
  • Form is name

10
SGML elements
  • Elements are introduced by opening (lttaggt) and
    closing (lt/taggt) tags
  • Elements can contain content between the opening
    and closing tags
  • Content can contain other elements
  • Elements can have attributes with values
  • ltpgtlts typedeclarativegt ltw pospersonal
    pronoungt I lt/wgt like you. lt/sgt lt/pgt

11
SGML specification
  • Tags and entity references are defined in a
    separate controlling document, the Document Type
    Definition (DTD)
  • The DTD also specifies allowable attributes and
    element combinations, e.g., that a ltpgt must
    contain only ltsgt elements

12
Benefits of SGML
  • Its human-readable
  • Its machine-readable and portable across systems
  • Its not proprietary
  • Can represent nested structure and lists
  • Can represent arbitrary characters
  • Can enforce correctness of structure
  • Tools automate analysis and display

13
SGML and the web
  • HTML (Hypertext Markup Language) is based on SGML
    but with a predefined DTD
  • HTML is a W3C standard (of sorts)
  • HTML does not conform to all SGML rules
  • (e.g., tags with no closing counterpart)

14
XML
  • Extensible Markup Language
  • Also a W3C standard
  • Streamlines SGML by
  • eliminating some of the more complex DTD
    constructs
  • introducing Unicode support
  • introducing data types and namespaces

15
Unicode
  • A robust, extensible encoding of characters from
    many languages
  • Characters take from 1 to 4 bytes byte order
    shown by first character
  • First 256 characters identical to ISO-Latin-1
  • Leaves rendering to browser/word processor
  • Supports Mongolian, Tamil as well as Etruscan and
    Linear B
  • Rejected suggestions include Klingon and Elvish
    character sets

16
UTF-8 and UTF-16
  • Unicode transformation formats
  • UTF-8 takes one byte and expands to two or four
    if necessary
  • UTF-16 takes two bytes and expands to four if
    necessary
  • native for Microsoft Windows XP, MacOS, Java and
    .NET

17
Representing text annotations
  • Text-specific DTDs were developed for
    representing annotations for corpora
  • Most well-known effort is the TEI (Text Encoding
    Initiative) standard, first in SGML and now in
    XML
  • Sponsored by the ACL, ACH, ALLC with funding from
    NEH, EU, and others
  • See http//www.tei-c.org/ and http//www.tei-c.org
    /Guidelines/P5/

18
Grammatical tagging
  • One of the common additions to a text before
    further processing
  • A tag is added to each word
  • Allowed tags and their meanings can vary

19
Differences between tag sets
  • Size (controls level of granularity)
  • 87 (179) for Brown corpus, 132 (62) for British
    National Corpus
  • Syntactic theory
  • Features (comparative / superlative)
  • Focus (syntactic / semantic / morphology /
    distributional)
  • VBG for gerunds as adjectives
  • VB for all verbs, including auxiliaries

20
Tag set variation
21
Combined tags
  • Refinements introduced for special cases
    (examples from Brown tag set)
  • NP-TL for title words
  • NP-FW for foreign words
  • Multiple tags for graphic words that contain
    multiple lexemes
  • BEZ for isnt
  • PPSMD for shell
  • NN for possessive singular nouns

22
Reading
  • Parts-of-speech 3.1.1 3.1.4
  • Section 4.3 on markup and grammatical tags
  • Explore the TEI web site
Write a Comment
User Comments (0)
About PowerShow.com