I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

I256: Applied Natural Language Processing

Description:

Extracted artists, links, reviews from music websites ... British Left Waffles on Falkland Islands. Red Tape Holds Up New Bridges ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing


1
I256 Applied Natural Language Processing
Marti Hearst August 28, 2006    
2
Today
  • Motivation SIMS student projects
  • Course Goals
  • Why NLP is difficult
  • How to solve it? Corpus-based statistical
    approaches
  • What well do in this course

3
ANLP MotivationSIMS Masters Projects
  • HomeSkim (2005)
  • Chan, Lib, Mittal, Poon
  • Apartment search mashup
  • Extracted fields from Craigslist listings
  • http//www.ischool.berkeley.edu/programs/masters/
    projects/2006/homeskim
  • Orpheus (2004)
  • Maury, Viswanathan, Yang
  • Tool for discovering new and independent
    recording artists
  • Extracted artists, links, reviews from music
    websites
  • http//groups.sims.berkeley.edu/orpheus/demo/orph
    eus_demo.swf
  • Breaking Story (2002)
  • Reffell, Fitzpatrick, Aydelott
  • Summarize trends in news feeds
  • Categories and entities assigned to all news
    articles
  • http//dream.sims.berkeley.edu/newshound/

4
(No Transcript)
5
(No Transcript)
6
HomeSkim Craigslist Analysis
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Goals of this Course
  • Learn about the problems and possibilities of
    natural language analysis
  • What are the major issues?
  • What are the major solutions?
  • How well do they work?
  • How do they work (but to a lesser extent than CS
    295-4)?
  • At the end you should
  • Agree that language is subtle and interesting!
  • Feel some ownership over the algorithms
  • Be able to assess NLP problems
  • Know which solutions to apply when, and how
  • Be able to read papers in the field

12
Today
  • Motivation SIMS student projects
  • Course Goals
  • Why NLP is difficult.
  • How to solve it? Corpus-based statistical
    approaches.
  • What well do in this course.

13
Weve past the year 2001,but we are not closeto
realizing the dream(or nightmare )
14
  • Dave Bowman Open the pod bay doors, HAL

HAL 9000 Im sorry Dave. Im afraid I cant do
that.
15
Why is NLP difficult?
  • Computers are not brains
  • There is evidence that much of language
    understanding is built-in to the human brain
  • Computers do not socialize
  • Much of language is about communicating with
    people
  • Key problems
  • Representation of meaning
  • Language presupposed knowledge about the world
  • Language only reflects the surface of meaning
  • Language presupposes communication between people

16
Hidden Structure
  • English plural pronunciation
  • Toy s ? toyz add z
  • Book s ? books add s
  • Church s ? churchiz add iz
  • Box s ? boxiz add iz
  • Sheep s ? sheep add nothing
  • What about new words?
  • Bach s ? boxs why not boxiz?

17
Language subtleties
  • Adjective order and placement
  • A big black dog
  • A big black scary dog
  • A big scary dog
  • A scary big dog
  • A black big dog
  • Antonyms
  • Which sizes go together?
  • Big and little
  • Big and small
  • Large and small
  • Large and little

18
World Knowledge is subtle
  • He arrived at the lecture.
  • He chuckled at the lecture.
  • He arrived drunk.
  • He chuckled drunk.
  • He chuckled his way through the lecture.
  • He arrived his way through the lecture.

19
Words are ambiguous(have multiple meanings)
  • I know that.
  • I know that block.
  • I know that blocks the sun.
  • I know that block blocks the sun.

20
How can a machine understand these differences?
  • Get the cat with the gloves.

21
How can a machine understand these differences?
  • Get the sock from the cat with the gloves.
  • Get the glove from the cat with the socks.

22
How can a machine understand these differences?
  • Decorate the cake with the frosting.
  • Decorate the cake with the kids.
  • Throw out the cake with the frosting.
  • Throw out the cake with the kids.

23
Headline Ambiguity
  • Iraqi Head Seeks Arms
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • Kids Make Nutritious Snacks
  • British Left Waffles on Falkland Islands
  • Red Tape Holds Up New Bridges
  • Bush Wins on Budget, but More Lies Ahead
  • Hospitals are Sued by 7 Foot Doctors

24
The Role of Memorization
  • Children learn words quickly
  • Around age two they learn about 1 word every 2
    hours.
  • (Or 9 words/day)
  • Often only need one exposure to associate meaning
    with word
  • Can make mistakes, e.g., overgeneralization
  • I goed to the store.
  • Exactly how they do this is still under study
  • Adult vocabulary
  • Typical adult about 60,000 words
  • Literate adults about twice that.

25
The Role of Memorization
  • Dogs can do word association too!
  • Rico, a border collie in Germany
  • Knows the names of each of 100 toys
  • Can retrieve items called out to him with over
    90 accuracy.
  • Can also learn and remember the names of
    unfamiliar toys after just one encounter, putting
    him on a par with a three-year-old child.

http//www.nature.com/news/2004/040607/pf/040607-8
_pf.html
26
But there is too much to memorize!
  • establish
  • establishment
  • the church of England as the official state
    church.
  • disestablishment
  • antidisestablishment
  • antidisestablishmentarian
  • antidisestablishmentarianism
  • is a political philosophy that is opposed to the
    separation of church and state.

27
Rules and Memorization
  • Current thinking in psycholinguistics is that we
    use a combination of rules and memorization
  • However, this is very controversial
  • Mechanism
  • If there is an applicable rule, apply it
  • However, if there is a memorized version, that
    takes precedence. (Important for irregular
    words.)
  • Artists paint still lifes
  • Not still lives
  • Past tense of
  • think ? thought
  • blink ? blinked
  • This is a simplification for more on this, see
    Pinkers Words and Rules and The Language
    Instinct.

28
Representation of Meaning
  • I know that block blocks the sun.
  • How do we represent the meanings of block?
  • How do we represent I know?
  • How does that differ from I know that.?
  • Who is I?
  • How do we indicate that we are talking about
    earths sun vs. some other planets sun?
  • When did this take place? What if I move the
    block? What if I move my viewpoint? How do we
    represent this?

29
How to tackle these problems?
  • The field was stuck for quite some time.
  • A new approach started around 1990
  • Well, not really new, but the first time around,
    in the 50s, they didnt have the text, disk
    space, or GHz
  • Main idea combine memorizing and rules
  • How to do it
  • Get large text collections (corpora)
  • Compute statistics over the words in those
    collections
  • Surprisingly effective
  • Even better now with the Web

30
Example Problem
  • Grammar checker example
  • Which word to use?
  • ltprincipalgt ltprinciplegt
  • Solution look at which words surround each use
  • I am in my third year as the principal of Anamosa
    High School.
  • School-principal transfers caused some upset.
  • This is a simple formulation of the quantum
    mechanical uncertainty principle.
  • Power without principle is barren, but principle
    without power is futile. (Tony Blair)

31
Using Very, Very Large Corpora
  • Keep track of which words are the neighbors of
    each spelling in well-edited text, e.g.
  • Principal high school
  • Principle rule
  • At grammar-check time, choose the spelling best
    predicted by the surrounding words.
  • Surprising results
  • Log-linear improvement even to a billion words!
  • Getting more data is better than fine-tuning
    algorithms!

32
The Effects of LARGE Datasets
  • From Banko Brill 01

33
Real-World Applications of NLP
  • Spelling Suggestions/Corrections
  • Grammar Checking
  • Synonym Generation
  • Information Extraction
  • Text Categorization
  • Automated Customer Service
  • Speech Recognition (limited)
  • Machine Translation
  • In the (near?) future
  • Question Answering
  • Improving Web Search Engine results
  • Automated Metadata Assignment
  • Online Dialogs

34
Automatic Help Desk Translation at Microsoft
35
Synonym Generation
36
Synonym Generation
37
What Well Do in this Course
  • Read research papers and tutorials
  • Use NLTK-lite (Natural Language ToolKit) to try
    out various algorithms
  • Some homeworks will be to do some NLTK exercises
  • Well do some of this in class
  • Adopt a large text collection
  • Use a wide range of NLP techniques to process it
  • Work together to build a useful resource.
  • Final project
  • Either extend work on the collection weve been
    using, or chose from some suggestions I provide.
  • Your own idea only if I think it is very likely
    to work well.

38
Assignment for Thursday
  • Load python 2.4.3 and NLTK-lite onto your
    computers
  • Read Chapter 1 of Jurafsky Martin
  • Read NLTK-lite tutorial sections 2.1-2.4

39
Python
  • A terrific programming language
  • Interpreted
  • Object-oriented
  • Easy to interface to other things (web, DBMS, TK)
  • Good stuff from java, lisp, tcl, perl
  • Easy to learn
  • I learned it this summer by reading Learning
    Python
  • FUN!
  • Assignment for Thursday
  • Load python 2.4.3 and NLTK-lite onto your
    computers
  • Read Chapter 1 of Jurafsky Martin
  • Read NLTK-lite tutorial Chapter 2 sections

40
Questions?
Write a Comment
User Comments (0)
About PowerShow.com