TagHelper

About This Presentation

Title:

TagHelper

Description:

TagHelper & SIDE Carolyn Penstein Ros Language Technologies Institute/ Human-Computer Interaction Institute ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 70

Provided by: cpr1

Category:

more less

Transcript and Presenter's Notes

Title: TagHelper

1
TagHelper SIDE

Carolyn Penstein Rosé
Language Technologies Institute/ Human-Computer
Interaction Institute

2
TagHelper Tools and SIDE
Define Summaries
Annotate Data
Visualize Annotated Data
TagHelper Tools uses text mining technology to
automate annotation of conversational data
SIDE facilitates rapid prototyping of
reporting interfaces for group learning
facilitators
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Setting Up Your Data For TagHelper
23
Setting Up Your Data
24
How do you know when you have coded enough data?
What distinguishes Questions and Statements?
You need to code enough to avoid learning rules
that wont work
25
Creating a Trained Model
26
Training and Testing

Start TagHelper tools by double clicking on the
portal.bat icon in your TagHelperTools2 folder
You will then see the following tool pallet
The idea is that you will train a prediction
model on your coded data and then apply that
model to uncoded data
Click on Train New Models

27
Loading a File
First click on Add a File
Then select a file
28
Simplest Usage

Click GO!
TagHelper will use its default setting to train a
model on your coded examples
It will use that model to assign codes to the
uncoded examples

29
More Advanced Usage

The second option is to modify the default
settings
You get to the options you can set by clicking on
gtgt Options
After you finish that, click GO!

30
Evaluating Performance
31
Performance report

The performance report tells you
What dataset was used
What the customization settings were
At the bottom of the file are reliability
statistics and a confusion matrix that tells you
which types of errors are being made

32
Output File

The output file contains
The codes for each segment
Note that the segments that were already coded
will retain their original code
The other segments will have their automatic
predictions
The prediction column indicates the confidence of
the prediction

33
Overview of Basic Feature Extraction from Text
34
Customizations

To customize the settings
Select the file
Click on Options

35
Classifier Options

Rules of thumb
SMO is state-of-the-art for text classification
J48 is best with small feature sets also
handles contingencies between features well
Naïve Bayes works well for models where decisions
are made based on accumulating evidence rather
than hard and fast rules

36
Basic IdeaRepresent text as a vector where each
position corresponds to a termThis is called
the bag of words approach

Cows make cheese
110001
Hens lay eggs
001110

37
What cant you conclude from bag of words
representations?

Causality X caused Y versus Y caused X
Roles and Mood Which person ate the food that I
prepared this morning and drives the big car in
front of my cat versus The person, which
prepared food that my cat and I ate this morning,
drives in front of the big car.
Whos driving, whos eating, and whos preparing
food?

38
Basic Anatomy Layers of Linguistic Analysis

Phonology The sound structure of language
Basic sounds, syllables, rhythm, intonation
Morphology The building blocks of words
Inflection tense, number, gender
Derivation building words from other words,
transforming part of speech
Syntax Structural and functional relationships
between spans of text within a sentence
Phrase and clause structure
Semantics Literal meaning, propositional content
Pragmatics Non-literal meaning, language use,
language as action, social aspects of language
(tone, politeness)
Discourse Analysis Language in practice,
relationships between sentences, interaction
structures, discourse markers, anaphora and
ellipsis

39
Part of Speech Tagging
http//www.ldc.upenn.edu/Catalog/docs/treebank2/cl
93.html

1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition/subord
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10.LS List item marker
11.MD Modal

12.NN Noun, singular or mass
13.NNS Noun, plural
14.NNP Proper noun, singular
15.NNPS Proper noun, plural
16.PDT Predeterminer
17.POS Possessive ending
18.PRP Personal pronoun
19.PP Possessive pronoun
20.RB Adverb
21.RBR Adverb, comparative
22.RBS Adverb, superlative

40
Part of Speech Tagging
http//www.ldc.upenn.edu/Catalog/docs/treebank2/cl
93.html

23.RP Particle
24.SYM Symbol
25.TO to
26.UH Interjection
27.VB Verb, base form
28.VBD Verb, past tense
29.VBG Verb, gerund/present participle
30.VBN Verb, past participle
31.VBP Verb, non-3rd ps. sing. present

32.VBZ Verb, 3rd ps. sing. present
33.WDT wh-determiner
34.WP wh-pronoun
35.WP Possessive wh-pronoun
36.WRB wh-adverb

41
TagHelper Customizations

Feature Space Design
Think like a computer!
Machine learning algorithms look for features
that are good predictors, not features that are
necessarily meaningful
Look for approximations
If you want to find questions, you dont need to
do a complete syntactic analysis
Look for question marks
Look for wh-terms that occur immediately before
an auxilliary verb

42
TagHelper Customizations

Feature Space Design
Punctuation can be a stand in for mood
you think the answer is 9?
you think the answer is 9.
Bigrams capture simple lexical patterns
common denominator versus common multiple
POS bigrams capture syntactic or stylistic
information
the answer which is vs which is the answer
Line length can be a proxy for explanation depth

43
TagHelper Customizations

Feature Space Design
Contains non-stop word can be a predictor of
whether a conversational contribution is
contentful
ok sure versus the common denominator
Remove stop words removes some distracting
features
Stemming allows some generalization
Multiple, multiply, multiplication
Removing rare features is a cheap form of feature
selection
Features that only occur once or twice in the
corpus wont generalize, so they are a waste of
time to include in the vector space

44
Created Features
45
Why create new features by hand?

Rules
For simple rules, it might be easier and faster
to write the rules by hand instead of learning
them from examples
Features
More likely to capture meaningful generalizations
Build in knowledge so you can get by with less
training data

46
Rule Language

ANY() is used to create lists
COLOR ANY(red,yellow,green,blue,purple)
FOOD ANY(cake,pizza,hamburger,steak,bread)
ALL() is used to capture contingencies
ALL(cake,presents)
More complex rules
ALL(COLOR,FOOD)

Note that you may wish to use part-of-speech
tags in your rules!
47
What can you do with this rule language?

You may want to generalize across sets of related
words
Color red,yellow,orange,green,blue
Food cake,pizza,hamburger,steak,bread
You may want to detect contingencies
The text must mention both cake and presents in
order to count as a birthday party
You may want to combine these
The text must include a Color and a Food

48
Advanced Feature Editing
49
Advanced Feature Editing
50
Advanced Feature Editing
51
Advanced Feature Editing
52
Types of Basic Features

Primitive features inclulde unigrams, bigrams,
and POS bigrams

53
Types of Basic Features

The Options change which primitive features show
up in the Unigram, Bigram, and POS bigram lists
You can choose to remove stopwords or not
You can choose whether or not to strip endings
off words with stemming
You can choose how frequently a feature must
appear in your data in order for it to show up in
your lists

54
Types of Basic Features
Now lets look at how to create new features.
55
Creating New Features
56
Creating New Features
57
Creating New Features
58
Creating New Features
59
Creating New Features
60
Creating New Features
61
Creating New Features
62
Using the Display Option
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
Viewing Created Features
67
Viewing Created Features
68
Viewing Created Features
69
Any Questions?

Write a Comment

User Comments (0)