Methodologies for Evaluating Dialog Structure Annotation - PowerPoint PPT Presentation

About This Presentation
Title:

Methodologies for Evaluating Dialog Structure Annotation

Description:

Test whether the number of coders assigning the same label at each position is ... P(A) is the proportion of times the coders agree ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 22
Provided by: Moss
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Methodologies for Evaluating Dialog Structure Annotation


1
Methodologies for EvaluatingDialog Structure
Annotation
  • Ananlada Chotimongkol
  • Presented at Dialogs on Dialogs Reading Group
  • 27 January 2006

2
Dialog structure annotation evaluation
  • How good is the annotated dialog structure?
  • Evaluation methodologies
  • Qualitative evaluation (humans rate how good it
    is)
  • Compare against a gold standard (usually created
    by a human)
  • Evaluate the end product (task-based evaluation)
  • Evaluate the principles used
  • Inter-annotator agreement (comparing subjective
    judgment when there is no single correct answer)

3
Choosing evaluation methodologies
  • Depended on what kind of information being
    annotated
  • Categorical annotation
  • e.g. dialog act
  • Boundary annotation
  • e.g. discourse segment
  • Structural annotation
  • e.g. rhetorical structure

4
Categorical annotation evaluation
  • Cochran's Q test
  • Test whether the number of coders assigning the
    same label at each position is randomly
    distributed
  • Doesnt tell directly the degree of agreement
  • Percentage of agreement
  • Measures how often the coders agree
  • Doesnt account for agreement by chance
  • Kappa coefficient Carletta, 1996
  • Measures pairwise agreement among coders
    correcting for expected chance agreement

5
Kappa statistic
  • Kappa coefficient (K) measures pairwise agreement
    among coders on categorical judgment
  • P(A) is the proportion of times the coders agree
  • P(E) is the proportion of times they are expected
    to agree by chance
  • K gt 0.8 indicates substantial agreement
  • 0.67 lt K lt 0.8 indicates moderate agreement
  • Difficult to calculate chance expected agreement
    in some cases

6
Boundary annotation evaluation
  • Use Kappa coefficient
  • Dont compare the segments directly but compare a
    decision on placing each boundary
  • At each eligible point, making a binary decision
    whether to annotate it as boundary or
    non-boundary
  • However, Kappa coefficient doesnt accommodate
    near-miss boundaries
  • Redefine a matching criterion e.g. also count
    near-miss as match
  • Use other metrics e.g. probabilistic error metrics

7
Probabilistic error metrics
  • Pk Beeferman et al, 1999
  • Measure how likely two time points are classified
    into different segments
  • Small Pk means high degree of agreement
  • WindowDiff (WD) Pevzner and Hearst, 2002
  • Measure the number of intervening topic breaks
    between time points
  • Penalize the difference in the number of segment
    boundaries between two time points

8
Structural annotation evaluation
  • Cascaded approach
  • Evaluate one level at a time
  • Evaluate the annotation of the higher level only
    if the annotation of the lower level is agreed
  • Example nested game annotation in Map Task
    Carletta et al, 1997
  • Redefine matching criteria for structural
    annotation Flammia and Zue, 1995
  • Segment A matches segment B if A contains B
  • Segment A in annotation-i matches with segments
    in annotation-j if segments in annotation-j
    excludes segment A
  • Agreement criterion isnt symmetry
  • Flattened the hierarchical structure
  • Flatten the hierarchy into overlapping spans
  • Compute agreement on the spans or spans labels
  • Example RST annotation Marcu et al, 1999

9
Form-based dialog structure
  • Describe a dialog structure using a task
    structure a hierarchical organization of domain
    information
  • Task a subset of dialogs that has a specific
    goal
  • Sub-task
  • A decomposition of a task
  • Corresponds to one action (the process that uses
    related pieces of information together to create
    a new piece of information or a new dialog state)
  • Concept is a word or a group of words that
    captures information necessary for performing an
    action
  • Task structure is domain-dependent

10
An example of form-based structure annotation
  • lttask name gt
  • ltsub-task name gt
  • word1 word2 ltconcept name gtword3lt/conceptgt
    word4 wordn
  • word1 ltconcept name gtword2lt/conceptgt word3
    word4 wordn
  • lt/sub-taskgt
  • ltsub-task name gt
  • lt/sub-taskgt
  • lt/taskgt

11
Annotation experiment
  • Goal to verify that the form-based dialog
    structure can be understood and applied by other
    annotators
  • The subjects were asked to identify the task
    structure of the dialogs in two domains
  • Air travel planning domain
  • Map reading domain
  • Need a different set of labels for each domain
  • Equivalent to design domain-specific labels from
    the definition of dialog structure components

12
Annotation procedure
  • The subjects study an annotation guideline
  • Definition of the task structure
  • Examples from other domains (bus schedule and UAV
    flight simulation)
  • For each domain, the subject study the
    transcription of 2-3 dialogs
  • Create a set of labels for annotating the task
    structure
  • Annotate the given dialogs with the set of labels
    designed in 1)

13
Issues on task structure annotation evaluation
  • There are more than one acceptable annotation
  • Similar to MT evaluation
  • But difficult to obtain multiple references
  • The tag set used by two annotator may not be the
    same
  • lttimegttwo thirtylt/timegt
  • lttimegtlthourgttwolt/hourgt ltmingtthirtyltmingtlt/timegt
  • Difficult to define matching criteria
  • Mapping equivalent labels between two tag sets is
    subjective (and may not be possible)

14
Cross-annotator correction
  • Ask a different annotator (2nd annotator) to
    judge the annotation and make a correction on the
    part that doesnt conform to the guideline
  • If the 2nd annotator agrees with the 1st one, he
    will make no correction
  • The annotation of the 2nd annotator himself may
    be different because there can be more than one
    annotation that conform with the rule

15
Cross-annotator correction (2)
  • Pro
  • Easier to evaluate the agreement, the annotations
    are based on the same tag set
  • Allow more than one acceptable annotations
  • Con
  • Need another annotator, take time
  • Another subjective judgment
  • Need to measure amount of change made by the 2nd
    annotator

16
Cross-annotators
  • Who should be the 2nd annotators
  • Another subject who did the annotation also
  • Bias toward his own annotation?
  • Another subject who studies the guideline but
    didnt do his/her own annotation
  • May not think about the structure thoroughly
  • Experts
  • Can also measure annotation accuracy using an
    expert annotation as a reference

17
How to quantify amount of correction
  • Edit distance from the original annotation
  • Structural annotation, have to redefine edit
    operations
  • Lower number means higher agreement, but which
    range of values is acceptable
  • Inter-annotator agreement
  • Can apply structural annotation evaluation
  • Agreement number is meaningful, can compare
    across different domain

18
Cross-annotation agreement
  • Use similar approach to Marcu et al, 1999
  • Flatten the hierarchy into overlapping spans
  • Compute agreement on the labels of the spans
    (task, sub-task, concept labels)
  • Issues
  • A lot of possible spans with no label (esp. for
    concept annotation)
  • How to calculate P(E) when add new concepts

19
Objective annotation evaluation
  • Make it more comparable to other works
  • Easier to evaluation, dont need the 2nd
    annotator
  • Label-insensitive
  • 3 labels lttaskgt, ltsub-taskgt, ltconceptgt
  • May also consider the level of sub-tasks e.g.
    ltsub-task1gt, ltsub-task2gt
  • Kappa artificially high
  • Add qualitative analysis on what they dont agree
    on

20
Reference
  • J. Carletta, "Assessing agreement on
    classification tasks the kappa statistic,"
    Computational Linguistics, vol. 22, pp. 249-254,
    1996.
  • D. Beeferman, A. Berger, and J. Lafferty,
    "Statistical Models for Text Segmentation,"
    Machine Learning, vol. 34, pp. 177-210, 1999.
  • L. Pevzner and M. A. Hearst, "A critique and
    improvement of an evaluation metric for text
    segmentation," Computational Linguistics, vol.
    28, pp. 19-36, 2002.
  • J. Carletta, S. Isard, G. Doherty-Sneddon, A.
    Isard, J. C. Kowtko, and A. H. Anderson, "The
    reliability of a dialogue structure coding
    scheme," Computational Linguistics, vol. 23, pp.
    13-31, 1997.
  • G. Flammia and V. Zue, "Empirical evaluation of
    human performance and agreement in parsing
    discourse constituents in spoken dialogue," in
    the Proceedings of Eurospeech 1995. Madrid,
    Spain, 1995.
  • D. Marcu, E. Amorrortu, and M. Romera,
    "Experiments in constructing a corpus of
    discourse trees," in the Proceedings of the ACL
    Workshop on Standards and Tools for Discourse
    Tagging, College Park, MD, 1999.

21
Matching criteria
  • Exact match (pairwise)
  • Partial match (pairwise)
  • Agree with majority (pool of coders)
  • Agree with consensus (pool of coders)
Write a Comment
User Comments (0)
About PowerShow.com