Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst August 30, 2004
2Today
- Motivation SIMS student projects
- Course Goals
- Why NLP is difficult
- How to solve it? Corpus-based statistical
approaches - What well do in this course
3ANLP MotivationSIMS Masters Projects
- Breaking Story (2002)
- Summarize trends in news feeds
- Needs categories and entities assigned to all
news articles - http//dream.sims.berkeley.edu/newshound/
- BriefBank (2002)
- System for entering legal briefs
- Needs a topic category system for browsing
- http//briefbank.samuelsonclinic.org/
- Chronkite (2003)
- Personalized RSS feeds
- Needs categories and entities assigned to all web
pages - Paparrazi (2004)
- Analysis of blog activity
- Needs categories assigned to blog content
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10Goals of this Course
- Learn about the problems and possibilities of
natural language analysis - What are the major issues?
- What are the major solutions?
- How well do they work
- How do they work (but to a lesser extent than CS
295-4) - At the end you should
- Agree that language is subtle and interesting!
- Feel some ownership over the algorithms
- Be able to assess NLP problems
- Know which solutions to apply when, and how
- Be able to read papers in the field
11Today
- Motivation SIMS student projects
- Course Goals
- Why NLP is difficult
- How to solve it? Corpus-based statistical
approaches - What well do in this course
12Weve past the year 2001,but we are not closeto
realizing the dream(or nightmare )
13- Dave Bowman Open the pod bay doors, HAL
HAL 9000 Im sorry Dave. Im afraid I cant do
that.
14Why is NLP difficult?
- Computers are not brains
- There is evidence that much of language
understanding is built-in to the human brain - Computers do not socialize
- Much of language is about communicating with
people - Key problems
- Representation of meaning
- Language presupposed knowledge about the world
- Language only reflects the surface of meaning
- Language presupposes communication between people
15Hidden Structure
- English plural pronunciation
- Toy s ? toyz add z
- Book s ? books add s
- Church s ? churchiz add iz
- Box s ? boxiz add iz
- Sheep s ? sheep add nothing
- What about new words?
- Bach s ? boxs why not boxiz?
16Language subtleties
- Adjective order and placement
- A big black dog
- A big black scary dog
- A big scary dog
- A scary big dog
- A black big dog
- Antonyms
- Which sizes go together?
- Big and little
- Big and small
- Large and small
- Large and little
17World Knowledge is subtle
- He arrived at the lecture.
- He chuckled at the lecture.
- He arrived drunk.
- He chuckled drunk.
- He chuckled his way through the lecture.
- He arrived his way through the lecture.
18Words are ambiguous(have multiple meanings)
- I know that.
- I know that block.
- I know that blocks the sun.
- I know that block blocks the sun.
19Headline Ambiguity
- Iraqi Head Seeks Arms
- Juvenile Court to Try Shooting Defendant
- Teacher Strikes Idle Kids
- Kids Make Nutritious Snacks
- British Left Waffles on Falkland Islands
- Red Tape Holds Up New Bridges
- Bush Wins on Budget, but More Lies Ahead
- Hospitals are Sued by 7 Foot Doctors
20The Role of Memorization
- Children learn words quickly
- As many as 9 words/day
- Often only need one exposure to associate meaning
with word - Can make mistakes, e.g., overgeneralization
- I goed to the store.
- Exactly how they do this is still under study
21The Role of Memorization
- Dogs can do word association too!
- Rico, a border collie in Germany
- Knows the names of each of 100 toys
- Can retrieve items called out to him with over
90 accuracy. - Can also learn and remember the names of
unfamiliar toys after just one encounter, putting
him on a par with a three-year-old child.
http//www.nature.com/news/2004/040607/pf/040607-8
_pf.html
22But there is too much to memorize!
- establish
- establishment
- the church of England as the official state
church. - disestablishment
- antidisestablishment
- antidisestablishmentarian
- antidisestablishmentarianism
- is a political philosophy that is opposed to the
separation of church and state.
23Rules and Memorization
- Current thinking in psycholinguistics is that we
use a combination of rules and memorization - However, this is very controversial
- Mechanism
- If there is an applicable rule, apply it
- However, if there is a memorized version, that
takes precedence. (Important for irregular
words.) - Artists paint still lifes
- Not still lives
- Past tense of
- think ? thought
- blink ? blinked
- This is a simplification for more on this, see
Pinkers Words and Language and The Language
Instinct.
24Representation of Meaning
- I know that block blocks the sun.
- How do we represent the meanings of block?
- How do we represent I know?
- How does that differ from I know that.?
- Who is I?
- How do we indicate that we are talking about
earths sun vs. some other planets sun? - When did this take place? What if I move the
block? What if I move my viewpoint? How do we
represent this?
25How to tackle these problems?
- The field was stuck for quite some time.
- A new approach started around 1990
- Well, not really new, but the first time around,
in the 50s, they didnt have the text, disk
space, or GHz - Main idea combine memorizing and rules
- How to do it
- Get large text collections (corpora)
- Compute statistics over the words in those
collections - Surprisingly effective
- Even better now with the Web
26Corpus-based Example Pre-Nominal Adjective
Ordering
- Important for translation and generation
- Examples
- big fat Greek wedding
- fat Greek big wedding
- Some approaches try to characterize this as
semantic rules, e.g. - Age lt color, value lt dimension
- Data-intensive approaches
- Assume adjective ordering is independent of the
noun they modify - Compare how often you see a, b vs b, a
Keller Lapata, The Web as Baseline,
HLT-NAACL04
27Corpus-based Example Pre-Nominal Adjective
Ordering
- Data-intensive approaches
- Compare how often you see a, b vs b, a
- What happens when you encounter an unseen pair?
- Shaw and Hatzivassiloglou 99 use transitive
closutres - Malouf 00 uses a back-off bigram model
- P(lta,bgta,b) vs. P(ltb,agta,b)
- He also uses morphological analysis, semantic
similarity calculations and positional
probabilities - Keller and Lapata 04 use just the very simple
algorithm - But they use the web as their training set
- Gets 90 accuracy on 1000 sequences
- As good as or better than the complex algorithms
Keller Lapata, The Web as Baseline,
HLT-NAACL04
28Real-World Applications of NLP
- Spelling Suggestions/Corrections
- Grammar Checking
- Synonym Generation
- Information Extraction
- Text Categorization
- Automated Customer Service
- Speech Recognition (limited)
- Machine Translation
- In the (near?) future
- Question Answering
- Improving Web Search Engine results
- Automated Metadata Assignment
- Online Dialogs
29NLP in the Real World
- Synonym generation for
- Suggesting advertising keywords
- Suggesting search result refinement and expansion
30Synonym Generation
31Synonym Generation
32Synonym Generation
33Synonym Generation
34What Well Do in this Course
- Read research papers and tutorials
- Use NLTK (Natural Language ToolKit) to try out
various algorithms - Some homeworks will be to do some NLTK exercises
- Three mini-projects
- Two involve a selected collection
- The third is your choice, can also be on the
selected collection
35What Well Do in this Course
- Adopt a large text collection
- Use a wide range of NLP techniques to process it
- Release the results for others to use
36Which Text Collection?
37How to analyze a big collection?
38Python
- A terrific language
- Interpreted
- Object-oriented
- Easy to interface to other things (web, DBMS, TK)
- Good stuff from java, lisp, tcl, perl
- Easy to learn
- I learned it this summer by reading Learning
Python - FUN!
39Questions?