How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese - PowerPoint PPT Presentation

About This Presentation
Title:

How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

Description:

Behavioral evaluation of corpus representativeness for Maltese Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona) Amy LaCross (University ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 30
Provided by: lre50
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese


1
How specialized are specialized corpora?
Behavioral evaluation of corpus
representativeness for Maltese
  • Jerid Francom (Wake Forest University)
  • Adam Ussishkin (University of Arizona)
  • Amy LaCross (University of Arizona)
  • 19 May 2010 O7 (Evaluation of Methodologies),
    14.45-15.05
  • LREC 2010, Mediterranean Conference Center
  • Valletta, Malta

2
Acknowledgements
  • Generous contribution of data to this project by
    Dr. Albert Gatt (Univ. of Malta)
  • Statistical expertise from Jeff Berry (Univ. of
    Arizona)
  • Funding from the United States National Science
    Foundation (BCS-0715500) to Adam Ussishkin

3
Goals
  • IssueFor many languages, the quality of
    available textual data is less than ideal for
    corpus creation in the light of standard sampling
    practices.
  • ProposeBehavioral data can provide a valuable
    metric to evaluate corpus resources otherwise
    considered specialized.
  • CasePsyCoL Maltese Lexical Corpus
  • ContributeNovel, cross-discipline metric for
    evaluating the quality of language resources

4
Sparse coverage
  • Most of the worlds 5-7000 languages have no
    corpus resources
  • Efforts to fill the gap, often exploit the
    availability of language data on the web
  • An Crúbadán project, 446 languages (Scannell,
    2007)
  • McEnery et al., (2006) survey of recent work

5
Sparse coverage
  • Low-density languages (Borin, 2009)Languages in
    which resources exist but in limited
    quantity/quality
  • Limited access to print and/or electronic data
  • Available primary data may be less-than-representa
    tive
  • Weakens assurance that results from low-density
    language resources are credible

6
Corpus representativeness
  • What is a representative corpus?
  • An externally valid sample of language use
  • A sample that approximates what the language is.
  • Full range of structural types (language units)
  • What are the characteristics of such a sample?
  • Genre/register
  • Modality

7
An issue for low-density languages
  • Standard practice to achieve representativeness
  • Apply rigorous sampling methods
  • Collect large amounts of data
  • Problematic for low-density languages a
    representativeness bottleneck
  • Lack large amounts of data
  • Available data is often limited in register,
    modality, etc.
  • Corpus resources are typically specialized

8
Assessing representativeness
  • How do we know whether we have a representative
    sample?
  • We dont, in an absolute sense.
  • Faith in survey sampling practicesCasting the
    net far and wide
  • Can we be assured we dont have a representative
    sample?
  • Not exactly.
  • It is logically possible that smaller, less
    diverse samples are externally valid for
    linguistic units that appear in the collection.

9
Proposal
  • Need for an external metric.
  • Current proposal suggests findings from
    behavioral experimentation can provide a valuable
    metric to evaluate corpus resources.
  • Exploit the correlation between derived frequency
    counts and elicited behavioral reactions
  • Behavioral data and adjusted frequency (Gries
    2008 2009)
  • Of particular importance for specialized corpora

10
Behavioral findings
  • Well-known robust effects for relative frequency
    in language processing
  • Word naming RTs (e.g., Forster Chambers, 1973)
  • Lexical decision RTs (e.g., Carroll White,
    1973)
  • Sentence reading RTs (e.g., MacDonald, 1994)
  • Word familiarity ratings (e.g., Gernsbacher 1984)
  • Log frequency is a good predictor of behavior.

11
Approach
  • Evaluating corpus representativeness through
    behavioral assessment
  • Derive frequency counts from a specialized
    corpus
  • Elicit behavioral response of participants from
    target population
  • Assess correlation strength how well do
    behavioral responses correlate with corpus
    measures?

12
Case study and predictions
  • Case study
  • Calculate log frequency of subset of items in a
    Maltese lexical corpus
  • Measure subjective word familiarity ratings of
    native speakers of Maltese
  • Assess relative distribution of the measures
  • Prediction
  • Congruence between relative distributions
    indicates a representative sample of the language
  • Mismatches underscore potential sampling issues

13
The specialized corpus
  • PsyCoL Maltese Lexical Corpus (PMLC)(Francom,
    Ussishkin, and Woudstra, 2009)http//psycol.sbs.a
    rizona.edu/resources/
  • Online Maltese newspapers, 1998-1999 2005 -
    2007PsyCoL lab (59.8) and Dr. Albert Gatt
    (40.2)
  • 3,323,325 total tokens (53,000 unique)Token/type
    ratio of 1.6
  • Typical for low-density languages
  • Large corpus, still relatively small (cf. British
    National Corpus 100million Corpus of
    Contemporary American English 400 million)
  • Limited in register, modality

14
Linguistic variable to quantify
  • Because there is little previous quantitative
    research on Maltese, the empirical focus of this
    investigation was narrowed to
  • Semitic-origin verbs/binyanim (also known as
    form)
  • Semitic-origin verbs in Maltese conform to the
    classical Semitic binyan system (categories based
    on morphosyntactic and phonological properties)
  • Question How does frequency as measured in our
    corpus correlate with behavior?Can the binyan
    categories be exploited to provide correlations?

15
Maltese binyanim
Binyan Function Prosodic shape Example
1 basic active (transitive or intransitive) CVCVC kiser he broke
2 intensive of 1, transitive of 1 CVCCVC kisser he smashed
3 transitive of 1 CVCVC birek he blessed
5 passive of 2, reflexive of 2 tCVCCVC tkisser it got smashed
6 passive of 2, reflexive of 3 tCVCVC tkiteb he corresponded
7 passive of 1, reflexive of 1 nCVCVC nkiser it got broken
8 passive of 1, reflexive of 1 CtVCVC ftakar he remembered
9 inchoative, acquisition of a quality CCVC hmar he blushed
10 originally inchoative stVCCVC stenbah to wake
16
A behavioral task word familiarity
  • We devised three tests to measure corpus
    representativeness
  • Each test measured a different aspect of our
    corpus counts and our behavioral task.
  • The behavioral task involved native
    Maltese-speakers, who gave subjective word
    familiarity ratings for all Semitic-origin
    Maltese verbs taken from Aquilina (2000) n1536.
  • Scale from very unfamiliar to very familiar
  • Shown to be a reliable predictor of lexical
    processing (Connine et al. 1990)

17
Word familiarity experiment
  • Participants
  • 107 native speakers of Maltese
  • Task
  • Subjective word familiarity task, online

18
Measuring frequency in the corpus
  • We then used the PMLC to calculate word frequency
    measures for the same set of verbs.
  • Using regular expression-enabled searching, we
    counted token frequency for all verbs occurring
    in the PMLC (n447).
  • Frequency was then encoded as a log-based measure.

19
Three tests
  • Next, we conducted three distinct statistical
    analyses to assess correlation between these
    corpus measures and the results of our word
    familiarity experiment
  • 1. Statistical regression between corpus log
    frequency and behavioral data.
  • 2. Binned groups by frequency to determine
    whether any correlation is found.
  • 3. Binned items by binyan to determine whether
    any correlation is found.

20
1. Statistical regression
  • We found a weak correlation (r.14) these
    results show at best a trend toward correlation,
    but suggests that familiarity ratings likely do
    not predict word frequency given these results.

21
2. Binning by frequency
  • Binning into two bands shows a correlation
  • Binning into three bands also shows a correlation

22
2. Binning by frequency
  • An LMER analysis of each binning (2 groups and 3
    groups) shows significance
  • All contrasts for two-bin intervals
    (High/Low4.2, t2.0) and three-bin intervals
    (High/Mid7.1, t3.9 Mid/Low7.0, t2.2) were
    significant.
  • These results support the hypothesis that
    behavior and corpus measures are correlated.

23
3. Binning by binyan
  • Earlier and ongoing work (Frost et al. 1997,
    1998, 2000 Ussishkin et al. in progress) shows
    binyan effects in Hebrew in both visual and
    auditory modalities, so Maltese could be expected
    to show similar effects.
  • Our goal here is to measure whether verbs, when
    grouped by binyan, show a correlation between
    word frequency measures and word familiarity
    ratings.

24
3. Binning by binyan
  • Only binyanim 1, 2, 5, 7 were analyzed binyanim
    3, 6, 8, 9, and 10 were not included in the
    analyses because they are so sparsely populated

25
3. Binning by binyan
  • Word frequency results significant contrasts
    found between Binyanim 7 and 2 (ß.54, t6.0)
    and between Binyanim 7 and 5 (ß1.15, t-2.2).
  • Word familiarity results no significant
    contrasts found.

Binyan by word frequency
Binyan by word familiarity
26
General assessment
  • The results show that verb frequency
    distributions in the PMLC pattern to some degree
    with the psychological representations of native
    speakers (the representative population)
  • On the surface suggests the PMLC is on the right
    track, but underscores the specialized nature of
    corpus
  • However, a response bias in the word familiarity
    task may play a part in the mismatches
  • Ceiling effect may have contributed to lower
    correlation scores

27
General assessment
  • Reasons to be optimistic about the verb
    distributions in the PMLC
  • Distribution of verb count/ frequency (Zipf,
    1949)
  • Distribution of word length/ frequency (Li, 1992)
  • Both measures trend as expected for
    representative samples

28
Conclusion
  • Novel methodology direct comparison between
    corpus resource and behavior.
  • Highlighting a robust effect from
    psycholinguistics (frequency of linguistic units
    predicts behavior).
  • We predicted the opposite could occur this
    provides a way to validate LDL resources.
  • This approach encourages cross-discipline
    endeavors for resource development and
    theoretical investigation.

29
  • Thank you very much!
  • Grazzi hafna!
Write a Comment
User Comments (0)
About PowerShow.com