Data in Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

Data in Linguistics

Description:

Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists have attempted to transform linguistics step by step into a science. – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 11
Provided by: HansU151
Category:

less

Transcript and Presenter's Notes

Title: Data in Linguistics


1
Data in Linguistics
  • Linguistics -- by tradition -- is not an exact or
    empirical science. Modern linguists have
    attempted to transform linguistics step by step
    into a science.
  • An exact science needs formalized models and
    provable methods for verifying (or more often
    falsifying) theories.
  • Empirical science needs a methodology of how to
    obtain, process, evaluate data and how to exploit
    data for the verification (falsification) of
    theories.
  • An exact empirical science needs to establish the
    correspondence between data and formal models.
    Therefore data need to be interpreted.
    Quantitative data require methods and tools for
    measurement.
  • In linguistics, the quantitative branch of the
    discipline has been disconnected from the
    theoretical core of the field for many decades,
    since quantitative linguists could not measure
    phenomena that were in the focus of discussion.
    It was language technology that finally brought
    them together.
  • Example Astronomy
  • Photographs and spectral analyses of distant
    heavenly bodies are scientific data. However
    without their interpretation in relationship with
    the formal models, they are rather useless.

2
Data in Linguistics
  • A well developed and established concept of
    linguistic data is still missing.
  • No good theory of relationship between different
    types of data, e.g.,
  • example sentences,
  • online performance experiments,
  • corpora,
  • tree banks,
  • test suites
  • However,there has been progress in several areas,
    e.g.,
  • evaluating acceptability judgements - methodology
    for subjective rating tests.
  • annotation, interpretation of data
  • methods for using quantitative data in language
    technology

3
Types of Linguistic Data 1
  • Linguistically data are often classified into
    real and unreal data depending on their
    origin.
  • However, this dichotomy does not fully cover the
    range of possible sources.
  • naturally occurring data, e.g.,
  • (balanced) reference corpora
  • specialized corpora for specific subject domains
    or applications
  • incidentally diccovered linguistic examples
  • evoked or induced data , e.g.,
  • dialogue-scenario data
  • wizard-of-Oz data
  • invented or solicited data, e.g.,
  • sample sentences created by linguists
  • acceptability judgements solicited by linguists
  • test suites

4
Types of Linguistic Data 2
  • The dichotomy real and unreal does not
    necessarily coincide with the property of
    naturalness.
  • Linguistic examples are often considered
    unnatural. On the other hand, a large corpus
    may contain many sentences that are extremely
    unnatural.
  • Naturalness does not solely depend on the origin
    of the data.

5
Concept of Linguistic Data
  • If we view linguistics as an empirical science,
    pieces of linguistic knowledge have to be
    abstractions over linguistic data.
  • These abstractions are parts of our theories
    about contents and structure of linguistic
    competence and about the processes, constraints
    and preferences that govern linguistic
    performance.
  • Linguistic data are individual utterances, parts
    of utterances or collections of utterances in a
    certain human language (or several languages).
  • The utterances may be represented in written or
    spoken form (or signed), i.e., as textual or
    acoustic signals.

6
Annotated Data
  • Usually these collected utterances are annotated
    by additional information. If the annotation does
    not contain a partial linguistic interpretation
    of the utterances, the annotations may be
    considerd part of the data. Annotations that do
    not include linguistic interpretation are, e.g.
  • judgements of native speakers on the
    acceptability or appropriateness of the
    utterance,
  • information on speaker(s),
  • information on hearer(s) or intended audience,
  • information on the utterance situation (time,
    place, circumstances)
  • information on the published source,
  • typographic information,
  • layout and document structure,
  • textual transcriptions of spoken utterances,
  • transcription of pauses.

7
Interpreted Data
  • Annotations involving a partial linguistic
    interpretation are, e,g.
  • part-of-speech tags,
  • word sense information,
  • morphosyntactic features of words,
  • constituent structures for phrases or sentences,
  • coreference markers,
  • dependency structures,
  • predicate-argument structures,
  • reference identifications for term phrases,
  • information structures within sentences,
  • intonation contours,
  • speech acts,
  • discourse structures.

8
Parameters for Classification
  • language Spanish, English, German
  • sublanguage/register regional dialect,
    sociolect, vernacular, professional jargon,
    toddler speech
  • text sort(s) newspaper articles, wire news,
    political speech, control commands
  • subject domain stock rates, flight reservations,
  • type of producers professional journalist,
    student, radiologist
  • mode of production spoken, written, signed,
    morsed
  • medium of production pencil, PC with MS Word,
    dictaphone
  • conditions of production spontaneous, carefully
    composed, produced under time pressure
  • transmission encoding raw ascii code, HTML,
    digitized phone signal, unicode
  • medium of transmission telephone, WWW, CB radio
  • storage encoding raw ASCII code, HTML, AIFF
  • medium of storage DAT tape, CD ROM, hard disk
  • mode of presentation spoken, written, signed
  • medium of presentation newspaper, radio, book,
    tv show, theater performance,
  • type of intended recipients newspaper reader,
    booking agent, theater audience
  • number of intended recipients point-to-point,
    multicast, broadcast
  • synchronicity of discourse synchronous dialogue,
    asynchronous
  • direction one-way, two-way

9
Criteria for Usefulness
  • In order to be useful, data have to be
    representative.
  • representative of a certain linguistic
    phenomenon,
  • representative of a certain text sort,
  • representative of the expected input to some
    language technology application,
  • representative of the expected output of some
    language technology application,
  • representative of a certain speaker,
  • etc.
  • Can data be representative of an entire
    language?

10
Forschungsaufgaben
  • Rohdaten sind heute einfach zu beschaffen.
  • Die anspruchsvolle Aufgabe liegt in der
    linguistischen Interpretation.
  • Forschungsaufgaben
  • Entwurf der Annotationsschemata für die
    Beschreibungsebenen
  • Entwurf von Austauschformaten und
    Übersetzungswerkzeugen
  • Entwurf und Implementierung der Werkzeuge für die
    Korpusannotation
  • Teilautomatisierung der Annotation
  • Entwurf von Methoden und Werkzeugen für die
    Qualitätssicherung
  • Entwurf von Werkzeugen für die Nutzung der Daten
    in der Forschung(Abruf, Auswertung, zusätzliche
    Dokumentation)
  • Entwurf von Werkzeugen für die Nutzung der Daten
    für die Anwendungsentwicklung (Methoden und
    Werkzeuge für das Training)
Write a Comment
User Comments (0)
About PowerShow.com