I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

I256: Applied Natural Language Processing

Description:

Shallow Parsing. Break text up into non-overlapping contiguous subsets of tokens. ... Chunking vs. Full Syntactic Parsing 'G.K. Chesterton, author of The Man ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 20
Provided by: coursesIs8
Category:

less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing


1
I256 Applied Natural Language Processing
Marti Hearst Sept 25, 2006    
2
Shallow Parsing
  • Break text up into non-overlapping contiguous
    subsets of tokens.
  • Also called chunking, partial parsing, light
    parsing.
  • What is it useful for?
  • Entity recognition
  • people, locations, organizations
  • Studying linguistic patterns
  • gave NP
  • gave up NP in NP
  • gave NP NP
  • gave NP to NP
  • Can ignore complex structure when not relevant

3
A Relationship between Segmenting and Labeling
  • Tokenization segments the text
  • Tagging labels the text
  • Shallow parsing does both simultaneously.

4
Chunking vs. Full Syntactic Parsing
  • G.K. Chesterton, author of The Man who was
    Thursday

5
Representations for Chunks
  • IOB tags
  • Inside, outside, and begin
  • Why do we need a begin tag?

6
Representations for Chunks
  • Trees
  • Chunk structure is a two-level tree that spans
    the entire text, containing both chunks and
    non-chunks

7
CONLL Collection
  • From the Conference on Natural Language Learning
    Competition from 2000
  • Goal create machine learning methods to improve
    on the chunking task

8
CONLL Collection
  • Data in IOB format from WSJ
  • Word POS-tag IOB-tag
  • Training set 8936 sentences
  • Test set 2012 sentences
  • Tags from the Brill tagger
  • Penn Treebank Tags
  • Evaluation measure F-score
  • 2precisionrecall / (recallprecision)
  • Baseline was select the chunk tag that is most
    frequently associated with the POS tag, F 77.07
  • Best score in the contest was F94.13

9
nltk_lite and CONLL2000
Note that raw hides the IOB format
10
nltk_lite and CONLL2000
11
nltk_lite and CONLL2000
pp() stands for pretty print and applies to the
Tree data structure.
12
nltk_lite chunks in treebank
13
nltk_lite parses in treebank
14
Chunking with Regular Expressions
  • This time we write regexs over TAGS rather than
    words
  • ltDTgtltJJgt?ltNNgt
  • ltNN.gt
  • ltJJNNgt
  • Compile them with parse.ChunkRule()
  • rule parse.ChunkRule(ltDTNNgt)
  • chunkparser parse.RegexpChunk(rule,
    chunk_node NP)
  • Resulting object is of type Tree
  • Top-level node called S
  • Can change this label if you want, in third
    argument to RegexpChunk

15
Chunking with Regular Expressions
16
Chunking with Regular Expressions
  • Rule application is sensitive to order

17
Chinking
  • Specify what does not go into a chunk.
  • Kind of like specifying punctuation as being not
    alphanumeric and spaces.
  • Can be more difficult to think about.

18
Practice Regexp Chunking
  • Write rules to be able to produce this kind of
    chunking

19
Next Time
  • Evaluating Shallow Parsing
  • Begin Text Summarization
Write a Comment
User Comments (0)
About PowerShow.com