Writing a Corpus Cookbook - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Writing a Corpus Cookbook

Description:

Martin Wynne IRCS Workshop Philadelphia December 2001. 15 ... Martin Wynne IRCS Workshop Philadelphia December 2001. 16. Am I at the wrong conference? ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 18
Provided by: martin134
Category:

less

Transcript and Presenter's Notes

Title: Writing a Corpus Cookbook


1
Writing a Corpus Cookbook
  • Martin Wynne
  • Oxford Text Archive, Oxford University
  • Martin.Wynne_at_ota.ahds.ac.uk
  • 13th December 2001
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

2
Outline
  • The problem finding out about best practice in
    developing linguistic corpora
  • The Oxford Text Archive and the Arts and
    Humanities Data Service
  • A Guide to Good Practice
  • Best practice? Enforcing, recommending,
    suggesting and opting out.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

3
Corpora growth and expansion
  • More people want to develop corpora
  • Corpora of different types are being planned and
    built
  • Developers are not usually corpus linguists
  • How do they get help on best practice?
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

4
Sources of information
  • Introductions to Corpus Linguistics
  • Manuals and handbooks accompanying corpora
  • Papers in academic journals
  • Papers in conference proceedings
  • Practical guides to methodology in corpus
    linguistics?
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

5
A Guide to Good Practice
  • a practical guide to how to go about building a
    corpus
  • articles written by leading experts in the field
  • state of the art the application of current
    best practice
  • a range of alternative theoretical and
    methodological approaches.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

6
The Oxford Text Archive
  • http//ota.ahds.ac.uk/
  • Collect, catalogue, preserve, and redistribute
    digital resources of interest to those working in
    literary and linguistic studies within the UK's
    Higher Education and FE communities.
  • Develop appropriate licensing conditions and
    technical mechanisms for the effective
    distribution of such resources.
  • Promote good practice in the creation and use of
    such resources in both research and teaching.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

7
Guides to Good Practice
  • Alan Morrison, Michael Popham and Karen Wikander,
    Creating and Documenting Electronic Texts, Oxbow
    Books, Oxford and freely available online at
    http//ota.ahds.ac.uk/documents/creating/
  • Developing Linguistic Corpora Drawing on the
    experiences of the British National Corpus
    Project, this Guide will examine the creation of
    large-scale electronic corpora for linguistic
    study. It will also identify the factors which
    must be taken into account when designing,
    creating, and distributing such corpora
    http//ahds.ac.uk/guides.htm .
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

8
Developing Linguistic Corpora (1)
  • The text and the corpus some basic principles
  • Design and representativeness
  • Metadata and textual encoding
  • Working with speech
  • Working with languages other than English
  • Parallel corpora
  • Linguistic annotation
  • (Archiving and distribution).
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

9
Developing Linguistic Corpora (2)
  • Clear, practical guidelines
  • Published online, exploiting the possibilities of
    hypertext and multimedia
  • (Print version may be generated from hypertext)
  • No constraints on space for case studies,
    references and links.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

10
Standards, Recommendations, Choices
  • Do we have sufficiently mature and established
    standards and practices, such that we can
    confidently recommend them to users?
  • An example
  • Adding annotation (what linguists do?)
  • SGML (/XML) single hierarchy
  • Bracketting paradox problem.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

11
Overlapping hierarchies (1)
  • One officer said This is like an episode from
    Inspector Morse.
  • "The victim was single but we believe he had
    several lady friends.
  • "It is possible that it was something in the
    background of one of those relationships that
    caused his death.
  • "We don't think he was linked with any criminals
    or involved in any secret wrong doing.
  • Police have not ruled out the possibility of a
    contract killing by a hitman.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

12
Overlapping hierarchies (2)
  • ltpgt
  • ltsp catNRSgt
  • One officer saidlt/spgt
  • ltsp catDSgt
  • This is like an episode from Inspector
    Morse.lt/pgt
  • ltpgt
  • "The victim was single but we believe he had
    several lady friends.lt/pgt
  • ltpgt
  • "It is possible that it was something in the
    background of one of those relationships that
    caused his death. .lt/pgt
  • ltpgt
  • "We don't think he was linked with any criminals
    or involved in any secret wrong doing."
    .lt/pgtlt/spgt
  • ltpgt
  • ltsp catNgt
  • Police have not ruled out the possibility of a
    contract killing by a hitman.lt/spgtlt/pgt
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

13
Advantages of stand-off
  • The integrity of the text is not compromised by
    the annotation
  • The text files dont become too large or too
    cluttered
  • Collaborative working is facilitated
  • Multiple annotations (and annotations of
    annotations) are made easier
  • Many more spin-offs from the separation of text
    and annotation which we are as yet unaware
  • In fact, thats why people have been using it
    since the first large computer corpus.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

14
Problems with XML stand-off
  • You need tokenisation of the text of sufficient
    granularity in order to knit in the annotation
  • You need unique ids for all tags
  • The annotated text is not a single text file
    readable by humans
  • Editing the text breaks the knitting
  • The technology is not sufficiently developed
    anyway.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

15
Enforcing, recommending, suggesting (and not
opting out!)
  • Should we be promoting, or even recommending,
    standards which are difficult to adhere to?
  • And whose needs being addressed in proposing
    standards anyway?
  • On the one hand, new procedures need a push to
    mature and to gain acceptance
  • We have to tread a difficult tightrope
  • WARNING we dont want to scare people off
  • WARNING we dont want to stifle innovation in
    practice and technology.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

16
Am I at the wrong conference?
  • Database technology brings structured data and
    the structure of the data to the fore
  • Structured linguistic data is (should be)
    extracted from texts it is an abstraction from
    what happens in texts
  • Annotation involves a loss of meaning
  • Linguistic facts must relate to meaning
  • Meaning exists in texts
  • Integrity of the text should be respected so that
    we can always go back to it to test and revise
    our analyses.
  • Martin Wynne IRCS Workshop Philadelphia December
    2001

17
  • .
  • Martin Wynne IRCS Workshop Philadelphia December
    2001
Write a Comment
User Comments (0)
About PowerShow.com