Corpora - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Corpora

Description:

A corpus is a set of texts used together to train or evaluate a system ... BNC-Baby: 4 Mword cheap version. If converted to books. 30 ft of shelf space ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 14
Provided by: hltUtd
Category:
Tags: books | corpora

less

Transcript and Presenter's Notes

Title: Corpora


1
Corpora
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Corpora
  • A corpus is a set of texts used together to train
    or evaluate a system
  • Various corpora built in increasing size
  • Brown corpus (1 million words)
  • Wall Street Journal corpora (7-20 Mwords)
  • Penn Treebank (5-10 Mwords)
  • Canadian Hansard corpus (26 Mwords, bilingual)
  • British National Corpus (BNC, 100 Mwords and
    BNC-Baby, 4 Mwords)

3
The Brown corpus
  • Built in the 1960s
  • A collection of sample documents (sometimes
    excerpts) published in 1961 in the US (written
    American English, prose)
  • 15 categories (e.g., PressReportage,
    PressEditorial, FictionMystery, Religion)
  • 500 texts, 1 million words
  • Balance across categories a major goal

4
Similar corpora
  • Brown has served as the standard reference corpus
    for many applications
  • Version with POS information
  • Corpora with similar design criteria and size
    have been developed for other English dialects
  • Lancaster-Oslo-Bergen (LOB, British English)
  • Kolhapur (Indian English)
  • New versions of Brown and LOB with texts from
    1991 (Freiburg)

5
The WSJ and NY Times corpora
  • Based on text published in the respective
    newspapers
  • No particular concern for balancing across
    subgenres
  • WSJ corpora were built in the early 1990s, NY
    Times corpora in the late 1990s
  • Size varies from 5-30 million words for WSJ to 70
    million words for NY Times

6
The Penn Treebank
  • Based mostly on WSJ text
  • 1-10 million words, depending on version
  • Rich in annotation
  • Part of speech
  • Syntactic bracketing and annotation (1990s)
  • (I (ate (meatballs (with spaghetti))))
  • Semantic annotation predicate-argument (2001 and
    later)

7
Speech Corpora
  • Usually smaller than their written text
    counterparts, often specialized
  • TI-DIGITS (1982) digit sequences
  • 77 digit sequences
  • 326 speakers (man, woman, boy, girl)
  • Conversations in a specific domain
  • ATIS (1991) on airline reservations, 1040
    utterances from 36 speakers

8
Speech corpora
  • Conversations, unscripted
  • Switchboard (1992) 2.4 million words of
    telephone conversations in American English
  • 120 hours of speech
  • Conversations, multilingual
  • CallHome (1996 and later) 200 conversations in
    each of American English, Japanese, Mandarin,
    Spanish, German, and Egyptian Arabic

9
The British National Corpus
  • 100 million words, 4,124 texts
  • Balanced among genres
  • 90 written, 10 speech (863 transcribed texts)
  • BNC-Baby 4 Mword cheap version
  • If converted to books
  • 30 ft of shelf space
  • If read aloud (8 hours per day)
  • 4 years

10
NYT Annotated Corpus
  • Contains almost all articles that appeared in the
    New York Times between January 1987 and June 2007
  • Prepared by the NYT RD division, available
    through the LDC (October 2008) for 300
  • 1.8 million articles
  • XML News Industry Text Format (NITF)
  • Annotations about topics, places, persons

11
Multilingual corpora
  • Parallel corpora
  • Canadian Hansard corpus debates from the
    Canadian parliament, early 1990s, 26 million
    words
  • Comparable multilingual corpora
  • Same genre, maybe same topic, but not parallel
  • Various corpora from international organizations
    (e.g., United Nations)

12
Getting corpora
  • Major organizations for distributing corpora
  • The Linguistic Data Consortium in the U.S.
    (http//www.ldc.upenn.edu/)
  • The European Language Resources Association
    (http//www.elra.info/)
  • Repositories of freely accessible corpora
  • Oxford Text Archive (http//ota.ahds.ac.uk/)
  • Electronic Text Archive (http//etext.lib.virginia
    .edu/)
  • Project Gutenberg (http//promo.net/pg/)

13
Reading
  • Sections 1.4.1 and 4.1.2 on corpora
  • Explore the named web sites for corpora
Write a Comment
User Comments (0)
About PowerShow.com