How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese - PowerPoint PPT Presentation

About This Presentation

Title:

How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

Description:

Behavioral evaluation of corpus representativeness for Maltese Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona) Amy LaCross (University ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 30

Provided by: lre50

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

1
How specialized are specialized corpora?
Behavioral evaluation of corpus
representativeness for Maltese

Jerid Francom (Wake Forest University)
Adam Ussishkin (University of Arizona)
Amy LaCross (University of Arizona)
19 May 2010 O7 (Evaluation of Methodologies),
14.45-15.05
LREC 2010, Mediterranean Conference Center
Valletta, Malta

2
Acknowledgements

Generous contribution of data to this project by
Dr. Albert Gatt (Univ. of Malta)
Statistical expertise from Jeff Berry (Univ. of
Arizona)
Funding from the United States National Science
Foundation (BCS-0715500) to Adam Ussishkin

3
Goals

IssueFor many languages, the quality of
available textual data is less than ideal for
corpus creation in the light of standard sampling
practices.
ProposeBehavioral data can provide a valuable
metric to evaluate corpus resources otherwise
considered specialized.
CasePsyCoL Maltese Lexical Corpus
ContributeNovel, cross-discipline metric for
evaluating the quality of language resources

4
Sparse coverage

Most of the worlds 5-7000 languages have no
corpus resources
Efforts to fill the gap, often exploit the
availability of language data on the web
An Crúbadán project, 446 languages (Scannell,
2007)
McEnery et al., (2006) survey of recent work

5
Sparse coverage

Low-density languages (Borin, 2009)Languages in
which resources exist but in limited
quantity/quality
Limited access to print and/or electronic data
Available primary data may be less-than-representa
tive
Weakens assurance that results from low-density
language resources are credible

6
Corpus representativeness

What is a representative corpus?
An externally valid sample of language use
A sample that approximates what the language is.
Full range of structural types (language units)
What are the characteristics of such a sample?
Genre/register
Modality

7
An issue for low-density languages

Standard practice to achieve representativeness
Apply rigorous sampling methods
Collect large amounts of data
Problematic for low-density languages a
representativeness bottleneck
Lack large amounts of data
Available data is often limited in register,
modality, etc.
Corpus resources are typically specialized

8
Assessing representativeness

How do we know whether we have a representative
sample?
We dont, in an absolute sense.
Faith in survey sampling practicesCasting the
net far and wide
Can we be assured we dont have a representative
sample?
Not exactly.
It is logically possible that smaller, less
diverse samples are externally valid for
linguistic units that appear in the collection.

9
Proposal

Need for an external metric.
Current proposal suggests findings from
behavioral experimentation can provide a valuable
metric to evaluate corpus resources.
Exploit the correlation between derived frequency
counts and elicited behavioral reactions
Behavioral data and adjusted frequency (Gries
2008 2009)
Of particular importance for specialized corpora

10
Behavioral findings

Well-known robust effects for relative frequency
in language processing
Word naming RTs (e.g., Forster Chambers, 1973)
Lexical decision RTs (e.g., Carroll White,
1973)
Sentence reading RTs (e.g., MacDonald, 1994)
Word familiarity ratings (e.g., Gernsbacher 1984)
Log frequency is a good predictor of behavior.

11
Approach

Evaluating corpus representativeness through
behavioral assessment
Derive frequency counts from a specialized
corpus
Elicit behavioral response of participants from
target population
Assess correlation strength how well do
behavioral responses correlate with corpus
measures?

12
Case study and predictions

Case study
Calculate log frequency of subset of items in a
Maltese lexical corpus
Measure subjective word familiarity ratings of
native speakers of Maltese
Assess relative distribution of the measures
Prediction
Congruence between relative distributions
indicates a representative sample of the language
Mismatches underscore potential sampling issues

13
The specialized corpus

PsyCoL Maltese Lexical Corpus (PMLC)(Francom,
Ussishkin, and Woudstra, 2009)http//psycol.sbs.a
rizona.edu/resources/
Online Maltese newspapers, 1998-1999 2005 -
2007PsyCoL lab (59.8) and Dr. Albert Gatt
(40.2)
3,323,325 total tokens (53,000 unique)Token/type
ratio of 1.6
Typical for low-density languages
Large corpus, still relatively small (cf. British
National Corpus 100million Corpus of
Contemporary American English 400 million)
Limited in register, modality

14
Linguistic variable to quantify

Because there is little previous quantitative
research on Maltese, the empirical focus of this
investigation was narrowed to
Semitic-origin verbs/binyanim (also known as
form)
Semitic-origin verbs in Maltese conform to the
classical Semitic binyan system (categories based
on morphosyntactic and phonological properties)
Question How does frequency as measured in our
corpus correlate with behavior?Can the binyan
categories be exploited to provide correlations?

15
Maltese binyanim
Binyan Function Prosodic shape Example
1 basic active (transitive or intransitive) CVCVC kiser he broke
2 intensive of 1, transitive of 1 CVCCVC kisser he smashed
3 transitive of 1 CVCVC birek he blessed
5 passive of 2, reflexive of 2 tCVCCVC tkisser it got smashed
6 passive of 2, reflexive of 3 tCVCVC tkiteb he corresponded
7 passive of 1, reflexive of 1 nCVCVC nkiser it got broken
8 passive of 1, reflexive of 1 CtVCVC ftakar he remembered
9 inchoative, acquisition of a quality CCVC hmar he blushed
10 originally inchoative stVCCVC stenbah to wake
16
A behavioral task word familiarity

We devised three tests to measure corpus
representativeness
Each test measured a different aspect of our
corpus counts and our behavioral task.
The behavioral task involved native
Maltese-speakers, who gave subjective word
familiarity ratings for all Semitic-origin
Maltese verbs taken from Aquilina (2000) n1536.
Scale from very unfamiliar to very familiar
Shown to be a reliable predictor of lexical
processing (Connine et al. 1990)

17
Word familiarity experiment

Participants
107 native speakers of Maltese
Task
Subjective word familiarity task, online

18
Measuring frequency in the corpus

We then used the PMLC to calculate word frequency
measures for the same set of verbs.
Using regular expression-enabled searching, we
counted token frequency for all verbs occurring
in the PMLC (n447).
Frequency was then encoded as a log-based measure.

19
Three tests

Next, we conducted three distinct statistical
analyses to assess correlation between these
corpus measures and the results of our word
familiarity experiment
1. Statistical regression between corpus log
frequency and behavioral data.
2. Binned groups by frequency to determine
whether any correlation is found.
3. Binned items by binyan to determine whether
any correlation is found.

20
1. Statistical regression

We found a weak correlation (r.14) these
results show at best a trend toward correlation,
but suggests that familiarity ratings likely do
not predict word frequency given these results.

21
2. Binning by frequency

Binning into two bands shows a correlation

Binning into three bands also shows a correlation

22
2. Binning by frequency

An LMER analysis of each binning (2 groups and 3
groups) shows significance
All contrasts for two-bin intervals
(High/Low4.2, t2.0) and three-bin intervals
(High/Mid7.1, t3.9 Mid/Low7.0, t2.2) were
significant.
These results support the hypothesis that
behavior and corpus measures are correlated.

23
3. Binning by binyan

Earlier and ongoing work (Frost et al. 1997,
1998, 2000 Ussishkin et al. in progress) shows
binyan effects in Hebrew in both visual and
auditory modalities, so Maltese could be expected
to show similar effects.
Our goal here is to measure whether verbs, when
grouped by binyan, show a correlation between
word frequency measures and word familiarity
ratings.

24
3. Binning by binyan

Only binyanim 1, 2, 5, 7 were analyzed binyanim
3, 6, 8, 9, and 10 were not included in the
analyses because they are so sparsely populated

25
3. Binning by binyan

Word frequency results significant contrasts
found between Binyanim 7 and 2 (ß.54, t6.0)
and between Binyanim 7 and 5 (ß1.15, t-2.2).
Word familiarity results no significant
contrasts found.

Binyan by word frequency
Binyan by word familiarity
26
General assessment

The results show that verb frequency
distributions in the PMLC pattern to some degree
with the psychological representations of native
speakers (the representative population)
On the surface suggests the PMLC is on the right
track, but underscores the specialized nature of
corpus
However, a response bias in the word familiarity
task may play a part in the mismatches
Ceiling effect may have contributed to lower
correlation scores

27
General assessment

Reasons to be optimistic about the verb
distributions in the PMLC
Distribution of verb count/ frequency (Zipf,
1949)
Distribution of word length/ frequency (Li, 1992)
Both measures trend as expected for
representative samples

28
Conclusion

Novel methodology direct comparison between
corpus resource and behavior.
Highlighting a robust effect from
psycholinguistics (frequency of linguistic units
predicts behavior).
We predicted the opposite could occur this
provides a way to validate LDL resources.
This approach encourages cross-discipline
endeavors for resource development and
theoretical investigation.