SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing

Description:

Summarize trends in news feeds. Needs categories and entities assigned to all news articles ... is a political philosophy that is opposed to the separation of ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 40
Provided by: martih
Category:

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing


1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst August 30, 2004    
2
Today
  • Motivation SIMS student projects
  • Course Goals
  • Why NLP is difficult
  • How to solve it? Corpus-based statistical
    approaches
  • What well do in this course

3
ANLP MotivationSIMS Masters Projects
  • Breaking Story (2002)
  • Summarize trends in news feeds
  • Needs categories and entities assigned to all
    news articles
  • http//dream.sims.berkeley.edu/newshound/
  • BriefBank (2002)
  • System for entering legal briefs
  • Needs a topic category system for browsing
  • http//briefbank.samuelsonclinic.org/
  • Chronkite (2003)
  • Personalized RSS feeds
  • Needs categories and entities assigned to all web
    pages
  • Paparrazi (2004)
  • Analysis of blog activity
  • Needs categories assigned to blog content

4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Goals of this Course
  • Learn about the problems and possibilities of
    natural language analysis
  • What are the major issues?
  • What are the major solutions?
  • How well do they work
  • How do they work (but to a lesser extent than CS
    295-4)
  • At the end you should
  • Agree that language is subtle and interesting!
  • Feel some ownership over the algorithms
  • Be able to assess NLP problems
  • Know which solutions to apply when, and how
  • Be able to read papers in the field

11
Today
  • Motivation SIMS student projects
  • Course Goals
  • Why NLP is difficult
  • How to solve it? Corpus-based statistical
    approaches
  • What well do in this course

12
Weve past the year 2001,but we are not closeto
realizing the dream(or nightmare )
13
  • Dave Bowman Open the pod bay doors, HAL

HAL 9000 Im sorry Dave. Im afraid I cant do
that.
14
Why is NLP difficult?
  • Computers are not brains
  • There is evidence that much of language
    understanding is built-in to the human brain
  • Computers do not socialize
  • Much of language is about communicating with
    people
  • Key problems
  • Representation of meaning
  • Language presupposed knowledge about the world
  • Language only reflects the surface of meaning
  • Language presupposes communication between people

15
Hidden Structure
  • English plural pronunciation
  • Toy s ? toyz add z
  • Book s ? books add s
  • Church s ? churchiz add iz
  • Box s ? boxiz add iz
  • Sheep s ? sheep add nothing
  • What about new words?
  • Bach s ? boxs why not boxiz?

16
Language subtleties
  • Adjective order and placement
  • A big black dog
  • A big black scary dog
  • A big scary dog
  • A scary big dog
  • A black big dog
  • Antonyms
  • Which sizes go together?
  • Big and little
  • Big and small
  • Large and small
  • Large and little

17
World Knowledge is subtle
  • He arrived at the lecture.
  • He chuckled at the lecture.
  • He arrived drunk.
  • He chuckled drunk.
  • He chuckled his way through the lecture.
  • He arrived his way through the lecture.

18
Words are ambiguous(have multiple meanings)
  • I know that.
  • I know that block.
  • I know that blocks the sun.
  • I know that block blocks the sun.

19
Headline Ambiguity
  • Iraqi Head Seeks Arms
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • Kids Make Nutritious Snacks
  • British Left Waffles on Falkland Islands
  • Red Tape Holds Up New Bridges
  • Bush Wins on Budget, but More Lies Ahead
  • Hospitals are Sued by 7 Foot Doctors

20
The Role of Memorization
  • Children learn words quickly
  • As many as 9 words/day
  • Often only need one exposure to associate meaning
    with word
  • Can make mistakes, e.g., overgeneralization
  • I goed to the store.
  • Exactly how they do this is still under study

21
The Role of Memorization
  • Dogs can do word association too!
  • Rico, a border collie in Germany
  • Knows the names of each of 100 toys
  • Can retrieve items called out to him with over
    90 accuracy.
  • Can also learn and remember the names of
    unfamiliar toys after just one encounter, putting
    him on a par with a three-year-old child.

http//www.nature.com/news/2004/040607/pf/040607-8
_pf.html
22
But there is too much to memorize!
  • establish
  • establishment
  • the church of England as the official state
    church.
  • disestablishment
  • antidisestablishment
  • antidisestablishmentarian
  • antidisestablishmentarianism
  • is a political philosophy that is opposed to the
    separation of church and state.

23
Rules and Memorization
  • Current thinking in psycholinguistics is that we
    use a combination of rules and memorization
  • However, this is very controversial
  • Mechanism
  • If there is an applicable rule, apply it
  • However, if there is a memorized version, that
    takes precedence. (Important for irregular
    words.)
  • Artists paint still lifes
  • Not still lives
  • Past tense of
  • think ? thought
  • blink ? blinked
  • This is a simplification for more on this, see
    Pinkers Words and Language and The Language
    Instinct.

24
Representation of Meaning
  • I know that block blocks the sun.
  • How do we represent the meanings of block?
  • How do we represent I know?
  • How does that differ from I know that.?
  • Who is I?
  • How do we indicate that we are talking about
    earths sun vs. some other planets sun?
  • When did this take place? What if I move the
    block? What if I move my viewpoint? How do we
    represent this?

25
How to tackle these problems?
  • The field was stuck for quite some time.
  • A new approach started around 1990
  • Well, not really new, but the first time around,
    in the 50s, they didnt have the text, disk
    space, or GHz
  • Main idea combine memorizing and rules
  • How to do it
  • Get large text collections (corpora)
  • Compute statistics over the words in those
    collections
  • Surprisingly effective
  • Even better now with the Web

26
Corpus-based Example Pre-Nominal Adjective
Ordering
  • Important for translation and generation
  • Examples
  • big fat Greek wedding
  • fat Greek big wedding
  • Some approaches try to characterize this as
    semantic rules, e.g.
  • Age lt color, value lt dimension
  • Data-intensive approaches
  • Assume adjective ordering is independent of the
    noun they modify
  • Compare how often you see a, b vs b, a

Keller Lapata, The Web as Baseline,
HLT-NAACL04
27
Corpus-based Example Pre-Nominal Adjective
Ordering
  • Data-intensive approaches
  • Compare how often you see a, b vs b, a
  • What happens when you encounter an unseen pair?
  • Shaw and Hatzivassiloglou 99 use transitive
    closutres
  • Malouf 00 uses a back-off bigram model
  • P(lta,bgta,b) vs. P(ltb,agta,b)
  • He also uses morphological analysis, semantic
    similarity calculations and positional
    probabilities
  • Keller and Lapata 04 use just the very simple
    algorithm
  • But they use the web as their training set
  • Gets 90 accuracy on 1000 sequences
  • As good as or better than the complex algorithms

Keller Lapata, The Web as Baseline,
HLT-NAACL04
28
Real-World Applications of NLP
  • Spelling Suggestions/Corrections
  • Grammar Checking
  • Synonym Generation
  • Information Extraction
  • Text Categorization
  • Automated Customer Service
  • Speech Recognition (limited)
  • Machine Translation
  • In the (near?) future
  • Question Answering
  • Improving Web Search Engine results
  • Automated Metadata Assignment
  • Online Dialogs

29
NLP in the Real World
  • Synonym generation for
  • Suggesting advertising keywords
  • Suggesting search result refinement and expansion

30
Synonym Generation
31
Synonym Generation
32
Synonym Generation
33
Synonym Generation
34
What Well Do in this Course
  • Read research papers and tutorials
  • Use NLTK (Natural Language ToolKit) to try out
    various algorithms
  • Some homeworks will be to do some NLTK exercises
  • Three mini-projects
  • Two involve a selected collection
  • The third is your choice, can also be on the
    selected collection

35
What Well Do in this Course
  • Adopt a large text collection
  • Use a wide range of NLP techniques to process it
  • Release the results for others to use

36
Which Text Collection?
37
How to analyze a big collection?
  • Your ideas go here

38
Python
  • A terrific language
  • Interpreted
  • Object-oriented
  • Easy to interface to other things (web, DBMS, TK)
  • Good stuff from java, lisp, tcl, perl
  • Easy to learn
  • I learned it this summer by reading Learning
    Python
  • FUN!

39
Questions?
Write a Comment
User Comments (0)
About PowerShow.com