Nancy Ide - PowerPoint PPT Presentation

About This Presentation
Title:

Nancy Ide

Description:

Corpus Linguistics 2000 American National Corpus Lancaster, England. Nancy Ide ... Includes email, ephemera, rap lyrics, newsgroups, etc. plus historically ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 19
Provided by: nanc4
Category:
Tags: ide | lyrics | nancy | rap

less

Transcript and Presenter's Notes

Title: Nancy Ide


1
  • Nancy Ide Vassar College
  • Catherine Macleod New York University

2
Why we need an ANC
  • Brown Corpus of American English
  • Too small to provide representative examples
  • Pre-1960 only
  • No spoken data
  • British National Corpus
  • Not representative of American English
  • Texts up to 1993 only

3
British vs. American English
  • Lexical Items
  • Bobby vs. cop, underground vs. subway, lorry vs.
    truck, pavement vs. sidewalk, football vs.
    soccer
  • Grammatical structures
  • She could not endure to live with him vs. She
    could not endure living with him.
  • Have you a pen? vs. Do you have a pen?
  • Modals
  • shall vs. should vs. ought vs. will vs.
    would vs. should
  • Adverbial Usage
  • Immediately I get home vs. As soon as I get
    home
  • Support Verbs
  • take a decision vs. make a decision

4
ANC Background
  • June 1998
  • ANC proposed at LREC98 by Charles Fillmore,
    Nancy Ide, Daniel Jurafsky, Catherine Macleod
  • May 1998
  • Publishers Day in Berkeley in conjunction with
    DSNA
  • November 1999
  • Organizational meeting, New York University

5
ANC Consortium
Pearson Education Random House Publishers Langensc
heidt Publishing Group Harper Collins
Publishers Cambridge University
Press LexiQuest Microsoft Corporation Shogakukan,I
nc. Associated Liberal Creators Press Taishukan
Publishers Oxford University Press Kenkyusha
Publishers IBM Corporation
6
Contributors
  • Founding consortium members
  • 21,000 over 3 years
  • Texts
  • Linguistic Data Consortium
  • Management and distribution of the ANC
  • Manpower and expertise to create initial version
  • NYU and Vassar
  • Expertise and manpower for corpus creation and
    annotation

7
ANC Makeup
  • Core static corpus
  • Texts and transcriptions of spoken data
  • 1990 onwards
  • Comparable in balance to BNC
  • Enables comparative studies
  • At least 100 million words
  • Snapshot of American English at the end of the
    millenium

8
Dynamic component
  • Not necessarily balanced
  • Dictated by availability
  • Includes email, ephemera, rap lyrics, newsgroups,
    etc. plus historically important works from
    various time periods
  • Add 10 every five years
  • Layered organization
  • Dynamic component layered chronologically as
    added

9
Eventual components
  • annotated and aligned speech data
  • dialects of American and Canadian English
  • other major languages of North America
  • Spanish,French Canadian
  • aligned to parallel translations in English.

High costs of production prevent inclusion at
this stage
10
Encoding and annotation
  • Markup compliant with the XML Corpus Encoding
    Standard (XCES)
  • Annotation
  • part of speech
  • Sub-paragraph elements
  • E.g., tokens, names, dates, numbers
  • Produced in a two-stage process

11
Stage 1 Base level corpus
  • Produced after year 1, using limited resources
  • XML markup compliant with XCES level 0
  • Markup produced by automatic transduction from
    original formats
  • Automatically tagged for part of speech
  • Only spot checking for validity
  • Minimal header
  • hand-produced
  • Includes domain information
  • Useful for concordance generation, collocation
    analysis

12
Stage 2 Final corpus
  • Available after year 3
  • XML markup conformant to XCES level 1
  • Full header
  • Markup for major structural divisions,
    paragraphs, sentence boundaries
  • Markup for some sub-paragraph elements, where can
    be done automatically
  • E.g., tokens, names, dates, numbers
  • 10 markup and annotation hand-validated
  • gold standard corpus

13
Data architecture
  • Follow XCES specifications for stand-off markup
  • Annotations in separate XML documents, linked to
    original
  • Easy to modify and/or add to
  • Enables a distributed development model
  • Different sites independently add annotation
  • Suitable for delivery over the WWW

14
Software
  • ANC project will provide search and access
    software
  • Encoding via XML and layered architecture enables
    exploiting the evolving XML environment for
    search, access, manipulation of ANC data
  • XML Transformation Language (XSLT)
  • Resource Description Framework (RDF)

15
Availability
  • Freely available to non-profit educational and
    research organizations from the outset
  • No restrictions on obtaining the corpus based on
    geographical location
  • Consortium members have exclusive access for
    commercial exploitation for 5 years
  • Distributed by LDC

16
Licensing
  • LDC
  • obtains licenses from text providers
  • issues licenses to users
  • no redistribution without publishers permission
  • open sub-corpus portion of the ANC
  • licensed on the model of open-source software

17
ANC Status
  • Founding memberships closed March 31 2001
  • Consortium membership now 40K
  • Text gathering, format transduction, header
    production underway
  • Base corpus due March 31 2002
  • Preparing production of level 1 corpus
  • Gathering technical input from research community
  • ANLP/NAACL workshop (Seattle, April 2000)
  • LREC workshop (Athens, June, 2000)
  • Seeking major funding
  • Final core corpus due March 31 2004

18
Information
  • ANC
  • http//AmericanNationalCorpus.org
  • Project Director
  • Catherine Macleod ltmacleod_at_cs.nyu.edugt
  • Technical Director
  • Nancy Ide ltide_at_cs.vassar.edugt
  • XCES
  • http//www.cs.vassar.edu/XCES
Write a Comment
User Comments (0)
About PowerShow.com