TagHelper - PowerPoint PPT Presentation

About This Presentation
Title:

TagHelper

Description:

TagHelper & SIDE Carolyn Penstein Ros Language Technologies Institute/ Human-Computer Interaction Institute ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 70
Provided by: cpr1
Learn more at: http://www.cs.cmu.edu
Category:
Tags: taghelper | past | tense | verb

less

Transcript and Presenter's Notes

Title: TagHelper


1
TagHelper SIDE
  • Carolyn Penstein Rosé
  • Language Technologies Institute/ Human-Computer
    Interaction Institute

2
TagHelper Tools and SIDE
Define Summaries
Annotate Data
Visualize Annotated Data
TagHelper Tools uses text mining technology to
automate annotation of conversational data
SIDE facilitates rapid prototyping of
reporting interfaces for group learning
facilitators
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Setting Up Your Data For TagHelper
23
Setting Up Your Data
24
How do you know when you have coded enough data?
What distinguishes Questions and Statements?
You need to code enough to avoid learning rules
that wont work
25
Creating a Trained Model
26
Training and Testing
  • Start TagHelper tools by double clicking on the
    portal.bat icon in your TagHelperTools2 folder
  • You will then see the following tool pallet
  • The idea is that you will train a prediction
    model on your coded data and then apply that
    model to uncoded data
  • Click on Train New Models

27
Loading a File
First click on Add a File
Then select a file
28
Simplest Usage
  • Click GO!
  • TagHelper will use its default setting to train a
    model on your coded examples
  • It will use that model to assign codes to the
    uncoded examples

29
More Advanced Usage
  • The second option is to modify the default
    settings
  • You get to the options you can set by clicking on
    gtgt Options
  • After you finish that, click GO!

30
Evaluating Performance
31
Performance report
  • The performance report tells you
  • What dataset was used
  • What the customization settings were
  • At the bottom of the file are reliability
    statistics and a confusion matrix that tells you
    which types of errors are being made

32
Output File
  • The output file contains
  • The codes for each segment
  • Note that the segments that were already coded
    will retain their original code
  • The other segments will have their automatic
    predictions
  • The prediction column indicates the confidence of
    the prediction

33
Overview of Basic Feature Extraction from Text
34
Customizations
  • To customize the settings
  • Select the file
  • Click on Options

35
Classifier Options
  • Rules of thumb
  • SMO is state-of-the-art for text classification
  • J48 is best with small feature sets also
    handles contingencies between features well
  • Naïve Bayes works well for models where decisions
    are made based on accumulating evidence rather
    than hard and fast rules

36
Basic IdeaRepresent text as a vector where each
position corresponds to a termThis is called
the bag of words approach
  • Cows make cheese
  • 110001
  • Hens lay eggs
  • 001110

37
What cant you conclude from bag of words
representations?
  • Causality X caused Y versus Y caused X
  • Roles and Mood Which person ate the food that I
    prepared this morning and drives the big car in
    front of my cat versus The person, which
    prepared food that my cat and I ate this morning,
    drives in front of the big car.
  • Whos driving, whos eating, and whos preparing
    food?

38
Basic Anatomy Layers of Linguistic Analysis
  • Phonology The sound structure of language
  • Basic sounds, syllables, rhythm, intonation
  • Morphology The building blocks of words
  • Inflection tense, number, gender
  • Derivation building words from other words,
    transforming part of speech
  • Syntax Structural and functional relationships
    between spans of text within a sentence
  • Phrase and clause structure
  • Semantics Literal meaning, propositional content
  • Pragmatics Non-literal meaning, language use,
    language as action, social aspects of language
    (tone, politeness)
  • Discourse Analysis Language in practice,
    relationships between sentences, interaction
    structures, discourse markers, anaphora and
    ellipsis

39
Part of Speech Tagging
http//www.ldc.upenn.edu/Catalog/docs/treebank2/cl
93.html
  • 1. CC Coordinating conjunction
  • 2. CD Cardinal number
  • 3. DT Determiner
  • 4. EX Existential there
  • 5. FW Foreign word
  • 6. IN Preposition/subord
  • 7. JJ Adjective
  • 8. JJR Adjective, comparative
  • 9. JJS Adjective, superlative
  • 10.LS List item marker
  • 11.MD Modal
  • 12.NN Noun, singular or mass
  • 13.NNS Noun, plural
  • 14.NNP Proper noun, singular
  • 15.NNPS Proper noun, plural
  • 16.PDT Predeterminer
  • 17.POS Possessive ending
  • 18.PRP Personal pronoun
  • 19.PP Possessive pronoun
  • 20.RB Adverb
  • 21.RBR Adverb, comparative
  • 22.RBS Adverb, superlative

40
Part of Speech Tagging
http//www.ldc.upenn.edu/Catalog/docs/treebank2/cl
93.html
  • 23.RP Particle
  • 24.SYM Symbol
  • 25.TO to
  • 26.UH Interjection
  • 27.VB Verb, base form
  • 28.VBD Verb, past tense
  • 29.VBG Verb, gerund/present participle
  • 30.VBN Verb, past participle
  • 31.VBP Verb, non-3rd ps. sing. present
  • 32.VBZ Verb, 3rd ps. sing. present
  • 33.WDT wh-determiner
  • 34.WP wh-pronoun
  • 35.WP Possessive wh-pronoun
  • 36.WRB wh-adverb

41
TagHelper Customizations
  • Feature Space Design
  • Think like a computer!
  • Machine learning algorithms look for features
    that are good predictors, not features that are
    necessarily meaningful
  • Look for approximations
  • If you want to find questions, you dont need to
    do a complete syntactic analysis
  • Look for question marks
  • Look for wh-terms that occur immediately before
    an auxilliary verb

42
TagHelper Customizations
  • Feature Space Design
  • Punctuation can be a stand in for mood
  • you think the answer is 9?
  • you think the answer is 9.
  • Bigrams capture simple lexical patterns
  • common denominator versus common multiple
  • POS bigrams capture syntactic or stylistic
    information
  • the answer which is vs which is the answer
  • Line length can be a proxy for explanation depth

43
TagHelper Customizations
  • Feature Space Design
  • Contains non-stop word can be a predictor of
    whether a conversational contribution is
    contentful
  • ok sure versus the common denominator
  • Remove stop words removes some distracting
    features
  • Stemming allows some generalization
  • Multiple, multiply, multiplication
  • Removing rare features is a cheap form of feature
    selection
  • Features that only occur once or twice in the
    corpus wont generalize, so they are a waste of
    time to include in the vector space

44
Created Features
45
Why create new features by hand?
  • Rules
  • For simple rules, it might be easier and faster
    to write the rules by hand instead of learning
    them from examples
  • Features
  • More likely to capture meaningful generalizations
  • Build in knowledge so you can get by with less
    training data

46
Rule Language
  • ANY() is used to create lists
  • COLOR ANY(red,yellow,green,blue,purple)
  • FOOD ANY(cake,pizza,hamburger,steak,bread)
  • ALL() is used to capture contingencies
  • ALL(cake,presents)
  • More complex rules
  • ALL(COLOR,FOOD)

Note that you may wish to use part-of-speech
tags in your rules!
47
What can you do with this rule language?
  • You may want to generalize across sets of related
    words
  • Color red,yellow,orange,green,blue
  • Food cake,pizza,hamburger,steak,bread
  • You may want to detect contingencies
  • The text must mention both cake and presents in
    order to count as a birthday party
  • You may want to combine these
  • The text must include a Color and a Food

48
Advanced Feature Editing
49
Advanced Feature Editing
50
Advanced Feature Editing
51
Advanced Feature Editing
52
Types of Basic Features
  • Primitive features inclulde unigrams, bigrams,
    and POS bigrams

53
Types of Basic Features
  • The Options change which primitive features show
    up in the Unigram, Bigram, and POS bigram lists
  • You can choose to remove stopwords or not
  • You can choose whether or not to strip endings
    off words with stemming
  • You can choose how frequently a feature must
    appear in your data in order for it to show up in
    your lists

54
Types of Basic Features
Now lets look at how to create new features.
55
Creating New Features
56
Creating New Features
57
Creating New Features
58
Creating New Features
59
Creating New Features
60
Creating New Features
61
Creating New Features
62
Using the Display Option
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
Viewing Created Features
67
Viewing Created Features
68
Viewing Created Features
69
Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com