Title: Methodologies for Evaluating Dialog Structure Annotation
1Methodologies for EvaluatingDialog Structure
Annotation
- Ananlada Chotimongkol
- Presented at Dialogs on Dialogs Reading Group
- 27 January 2006
2Dialog structure annotation evaluation
- How good is the annotated dialog structure?
- Evaluation methodologies
- Qualitative evaluation (humans rate how good it
is) - Compare against a gold standard (usually created
by a human) - Evaluate the end product (task-based evaluation)
- Evaluate the principles used
- Inter-annotator agreement (comparing subjective
judgment when there is no single correct answer)
3Choosing evaluation methodologies
- Depended on what kind of information being
annotated - Categorical annotation
- e.g. dialog act
- Boundary annotation
- e.g. discourse segment
- Structural annotation
- e.g. rhetorical structure
4Categorical annotation evaluation
- Cochran's Q test
- Test whether the number of coders assigning the
same label at each position is randomly
distributed - Doesnt tell directly the degree of agreement
- Percentage of agreement
- Measures how often the coders agree
- Doesnt account for agreement by chance
- Kappa coefficient Carletta, 1996
- Measures pairwise agreement among coders
correcting for expected chance agreement
5Kappa statistic
- Kappa coefficient (K) measures pairwise agreement
among coders on categorical judgment - P(A) is the proportion of times the coders agree
- P(E) is the proportion of times they are expected
to agree by chance - K gt 0.8 indicates substantial agreement
- 0.67 lt K lt 0.8 indicates moderate agreement
- Difficult to calculate chance expected agreement
in some cases
6Boundary annotation evaluation
- Use Kappa coefficient
- Dont compare the segments directly but compare a
decision on placing each boundary - At each eligible point, making a binary decision
whether to annotate it as boundary or
non-boundary - However, Kappa coefficient doesnt accommodate
near-miss boundaries - Redefine a matching criterion e.g. also count
near-miss as match - Use other metrics e.g. probabilistic error metrics
7Probabilistic error metrics
- Pk Beeferman et al, 1999
- Measure how likely two time points are classified
into different segments - Small Pk means high degree of agreement
- WindowDiff (WD) Pevzner and Hearst, 2002
- Measure the number of intervening topic breaks
between time points - Penalize the difference in the number of segment
boundaries between two time points
8Structural annotation evaluation
- Cascaded approach
- Evaluate one level at a time
- Evaluate the annotation of the higher level only
if the annotation of the lower level is agreed - Example nested game annotation in Map Task
Carletta et al, 1997 - Redefine matching criteria for structural
annotation Flammia and Zue, 1995 - Segment A matches segment B if A contains B
- Segment A in annotation-i matches with segments
in annotation-j if segments in annotation-j
excludes segment A - Agreement criterion isnt symmetry
- Flattened the hierarchical structure
- Flatten the hierarchy into overlapping spans
- Compute agreement on the spans or spans labels
- Example RST annotation Marcu et al, 1999
9Form-based dialog structure
- Describe a dialog structure using a task
structure a hierarchical organization of domain
information - Task a subset of dialogs that has a specific
goal - Sub-task
- A decomposition of a task
- Corresponds to one action (the process that uses
related pieces of information together to create
a new piece of information or a new dialog state)
- Concept is a word or a group of words that
captures information necessary for performing an
action - Task structure is domain-dependent
10An example of form-based structure annotation
- lttask name gt
- ltsub-task name gt
- word1 word2 ltconcept name gtword3lt/conceptgt
word4 wordn - word1 ltconcept name gtword2lt/conceptgt word3
word4 wordn -
- lt/sub-taskgt
- ltsub-task name gt
-
-
-
- lt/sub-taskgt
- lt/taskgt
11Annotation experiment
- Goal to verify that the form-based dialog
structure can be understood and applied by other
annotators - The subjects were asked to identify the task
structure of the dialogs in two domains - Air travel planning domain
- Map reading domain
- Need a different set of labels for each domain
- Equivalent to design domain-specific labels from
the definition of dialog structure components
12Annotation procedure
- The subjects study an annotation guideline
- Definition of the task structure
- Examples from other domains (bus schedule and UAV
flight simulation) - For each domain, the subject study the
transcription of 2-3 dialogs - Create a set of labels for annotating the task
structure - Annotate the given dialogs with the set of labels
designed in 1)
13Issues on task structure annotation evaluation
- There are more than one acceptable annotation
- Similar to MT evaluation
- But difficult to obtain multiple references
- The tag set used by two annotator may not be the
same - lttimegttwo thirtylt/timegt
- lttimegtlthourgttwolt/hourgt ltmingtthirtyltmingtlt/timegt
- Difficult to define matching criteria
- Mapping equivalent labels between two tag sets is
subjective (and may not be possible)
14Cross-annotator correction
- Ask a different annotator (2nd annotator) to
judge the annotation and make a correction on the
part that doesnt conform to the guideline - If the 2nd annotator agrees with the 1st one, he
will make no correction - The annotation of the 2nd annotator himself may
be different because there can be more than one
annotation that conform with the rule
15Cross-annotator correction (2)
- Pro
- Easier to evaluate the agreement, the annotations
are based on the same tag set - Allow more than one acceptable annotations
- Con
- Need another annotator, take time
- Another subjective judgment
- Need to measure amount of change made by the 2nd
annotator
16Cross-annotators
- Who should be the 2nd annotators
- Another subject who did the annotation also
- Bias toward his own annotation?
- Another subject who studies the guideline but
didnt do his/her own annotation - May not think about the structure thoroughly
- Experts
- Can also measure annotation accuracy using an
expert annotation as a reference
17How to quantify amount of correction
- Edit distance from the original annotation
- Structural annotation, have to redefine edit
operations - Lower number means higher agreement, but which
range of values is acceptable - Inter-annotator agreement
- Can apply structural annotation evaluation
- Agreement number is meaningful, can compare
across different domain
18Cross-annotation agreement
- Use similar approach to Marcu et al, 1999
- Flatten the hierarchy into overlapping spans
- Compute agreement on the labels of the spans
(task, sub-task, concept labels) - Issues
- A lot of possible spans with no label (esp. for
concept annotation) - How to calculate P(E) when add new concepts
19Objective annotation evaluation
- Make it more comparable to other works
- Easier to evaluation, dont need the 2nd
annotator - Label-insensitive
- 3 labels lttaskgt, ltsub-taskgt, ltconceptgt
- May also consider the level of sub-tasks e.g.
ltsub-task1gt, ltsub-task2gt - Kappa artificially high
- Add qualitative analysis on what they dont agree
on
20Reference
- J. Carletta, "Assessing agreement on
classification tasks the kappa statistic,"
Computational Linguistics, vol. 22, pp. 249-254,
1996. - D. Beeferman, A. Berger, and J. Lafferty,
"Statistical Models for Text Segmentation,"
Machine Learning, vol. 34, pp. 177-210, 1999. - L. Pevzner and M. A. Hearst, "A critique and
improvement of an evaluation metric for text
segmentation," Computational Linguistics, vol.
28, pp. 19-36, 2002. - J. Carletta, S. Isard, G. Doherty-Sneddon, A.
Isard, J. C. Kowtko, and A. H. Anderson, "The
reliability of a dialogue structure coding
scheme," Computational Linguistics, vol. 23, pp.
13-31, 1997. - G. Flammia and V. Zue, "Empirical evaluation of
human performance and agreement in parsing
discourse constituents in spoken dialogue," in
the Proceedings of Eurospeech 1995. Madrid,
Spain, 1995. - D. Marcu, E. Amorrortu, and M. Romera,
"Experiments in constructing a corpus of
discourse trees," in the Proceedings of the ACL
Workshop on Standards and Tools for Discourse
Tagging, College Park, MD, 1999.
21Matching criteria
- Exact match (pairwise)
- Partial match (pairwise)
- Agree with majority (pool of coders)
- Agree with consensus (pool of coders)