Introduction to the course Day 1 - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Introduction to the course Day 1

Description:

Day 1 LING 681.02 Computational Linguistics Harry Howard Tulane University Objectives This course shows you how to make a computer perform various useful tasks with ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 63
Provided by: HarryH5
Category:

less

Transcript and Presenter's Notes

Title: Introduction to the course Day 1


1
Introduction to the courseDay 1
  • LING 681.02
  • Computational Linguistics
  • Harry Howard
  • Tulane University

2
Objectives
  • This course shows you how to make a computer
    perform various useful tasks with natural
    language.
  • Through it you'll learn
  • some linguistics,
  • some algorithms,
  • some statistics,
  • and some computer programming in Python.
  • I do not require that you know anything in
    particular about these areas beforehand.

3
Objectives, cont.
  • Hopefully you'll finish the semester with an
    appreciation for the intricacies of modeling
    human languages,
  • plus some practical knowledge about solving
    linguistic problems, such as techniques for
  • filtering junk email,
  • automatically discovering different meanings of
    the word "run",
  • efficiently encoding spelling rules,
  • tagging words according to their part of speech,
  • parsing English sentences,
  • and automatically translating from one language
    to another,
  • among other things.

4
Objectives, cont.
  • Our work will be a combination of learning new
    algorithms, discussing linguistics, and
    programming useful systems that operate on real
    data.
  • It is great training if you are interested in
    doing natural language processing work in
    industry, either in a research lab or in a
    startup.

5
Why should you care?
  • Trends
  • An enormous amount of information is now
    available in machine readable form as natural
    language text.
  • Conversational agents (automated voices that
    answer the phone) are becoming an important form
    of human-computer communication.
  • Much of human-human communication is now mediated
    by computers.

6
Commercial world
Lots of exciting stuff is going on
Powerset
7
Google Translate
8
Google Translate
9
Web Q/A
10
Weblog analytics
  • Data-mining of Weblogs, discussion forums,
    message boards, user groups, and other forms of
    user generated media
  • Product marketing information
  • Political opinion tracking
  • Social network analysis
  • Buzz analysis (whats hot, what topics are people
    talking about right now).

11
Web analytics
12
Intended audience
  • Students of
  • linguistics,
  • cognitive science,
  • psychology,
  • neuroscience,
  • mathematics,
  • and any other discipline with an interest in how
    to process natural language by computer.

13
Outcomes
  • For you to demonstrate how well you have attained
    the objectives, you will perform the following
    tasks
  • Take a quiz or turn in a project almost every
    week, usually on Monday. 11-1 7.5 75
  • No quiz/project can be accepted late.
  • Even though these look like a lot of small
    grades, missing just one lowers your final grade
    almost an entire letter, as an unfortunate few of
    my students have found out the hard way.
  • If you know ahead of time that you will miss a
    quiz/project, send me an e-mail beforehand, and I
    will excuse you with no penalty.
  • Present a final project to the class on the final
    exam day (Dec 14) and turn in a report of your
    project within two days. 25
  • This may be a group effort, but the entire group
    will receive the same grade.

14
Participation
  • Note that there is no credit for class
    participation,
  • but I will change any high Y- into a low X if I
    notice you participating in class.

15
Prerequisites
  • There aren't any.
  • I do not take anything for granted and so will
    explain all background information, or at least
    suggest sources where you can find it on your own.

16
Code of academic integrity
  • The integrity of Newcomb-Tulane College is based
    on the absolute honesty of the entire community
    in all academic endeavors. As part of the Tulane
    University community, students have certain
    responsibilities regarding work that forms the
    basis for the evaluation of their academic
    achievement. Students are expected to be familiar
    with these responsibilities at all times. No
    member of the university community should
    tolerate any form of academic dishonesty, because
    the scholarly community of the university depends
    on the willingness of both instructors and
    students to uphold the Code of Academic Conduct.
    When a violation of the Code of Academic Conduct
    is observed it is the duty of every member of the
    academic community who has evidence of the
    violation to take action. Students should take
    steps to uphold the code by reporting any
    suspected offense to the instructor or the
    associate dean of the college. Students should
    under no circumstances tolerate any form of
    academic dishonesty.
  • For further information, point your browser at
    http//college.tulane.edu/code.htm.

17
Students with disabilities
  • Students with disabilities who need academic
    accommodation should
  • Contact and register with the Office of
    Disability Services (ODS). For more information,
    visit the ODS website at http//www.erc.tulane.edu
    /studentindex.html.
  • Bring official notice to me from the ODS
    indicating that you need academic accommodation.
    This should be done within the first week of
    class.

18
Electronic communications
  • http//www.tulane.edu/ling/NLP/
  • I will send you e-mail on a regular basis you
    must check your e-mail on a regular basis!
  • If you want to use a non-Tulane address, e-mail
    me a message to that effect from the address.
  • I will record and podcast every class.
  • The mp3 files will be available on the course
    website above.
  • I will post my PowerPoint presentation to the
    course website after every class.

19
The textbooks
  • Speech and Language Processing, 2e, (2008) by
    Daniel Jurafsky and James H. Martin SLP
  • Natural Language Processing with Python, 1e,
    (2009) by Steven Bird, Ewan Klein, and Edward
    Loper NLPP
  • Free at http//www.nltk.org/book

20
Natural Language Toolkit
  • The choice of Python as programming language for
    the course was motivated by the availability of
    the excellent tools in the Natural Language
    Toolkit (http//www.nltk.org/), which are
    programmed in Python.
  • As well as the just-published textbook that goes
    with it.
  • The authors of the NLTK choose Python, in turn,
    for the ease with which it lets you create NLP
    applications.

21
Python for beginners 1
  • Python for Software Design How to Think Like a
    Computer Scientist. 39
  • Think Python How to Think Like a Computer
    Scientist. Free from
  • http//www.greenteapress.com/thinkpython/thinkpyth
    on.html

22
Python for beginners, cont.
  • 3e, 35.
  • 4e released Oct 2

23
Python for those who know another language
  • http//software-carpentry.org/
  • Everyone should read "Introduction".

24
If you really want to use Perl (and Prolog)
  • An Introduction to Language Processing with Perl
    and Prolog An Outline of Theories,
    Implementation, and Application with Special
    Consideration of English, French, and German
    (2006) by Pierre Nugues
  • Free from library through SpringerLink
    http//www.springerlink.com.libproxy.tulane.edu20
    48/content/m34655/?p0440574d897b4a709b8cb0af2352f
    96api0
  • But see NLPP Preface.

25
Schedule of readingsand class preparation
  • You should come to class having completed the
    assignment for that day listed in the schedule.
  • We will spend the class going over the exercises
    in the assignment, answering questions that may
    have come up in the readings, and perhaps doing
    new exercises.
  • We will cover about 15 pages a day in SLP, plus a
    varying number of exercises.
  • This could take a considerable amount of time.

26
Computers
  • If you have a laptop, you will probably want to
    bring it to class.

27
Final exam day (Mon, Dec 14)
  • There is no final exam, but you must present your
    final project to the class on the final exam day.
  • You CANNOT leave town before then!
  • Tell your parents NOW!
  • You are hereby warned.
  • Do not tell me at the end of the semester that
    your parents bought you a ticket home without
    knowing.

28
Aesthetics
  • I know my slides are ugly and boring, but that is
    so that they print accurately.
  • They only use two fonts, Arial Bold and Times New
    Roman.

29
Contact
  • Prof. Harry Howard
  • howard at tulane dot edu
  • 862-3417 (voice mail 24 hours a day)
  • Newcomb Hall 322-D
  • Office hours T 3-5, W 4-5 and by appointment
    (the link goes to my home page, which displays my
    Google calendar)

30
Questions?
  • ?

31
What is the name of this course?
32
Speech Language Processing
  • 1. Introduction

33
Psychology
  • It should be noted that much recent research uses
    psychologically and even neurologically plausible
    algorithms for learning patterns from natural
    language texts,
  • so that we will emphasize the acquisition of
    linguistic knowledge, temporal processing, and
    the relation between perception and
    grammar/vocabulary.

34
Natural Language Processing
  • Were going to study what goes into getting
    computers to perform useful and interesting tasks
    involving human languages.
  • We are also concerned with the insights that such
    computational work gives us into human processing
    of language.

35
Major topics of book
  1. Words 2-6
  2. Speech 7-11
  3. Syntax 12-16
  4. Meaning 17-20
  5. Discourse 21
  1. Applications exploiting each 22-25

36
Applications
  • First, what makes an application a language
    processing application (as opposed to any other
    piece of software)?
  • An application that requires the use of knowledge
    about human language
  • Example Is Unix wc (word count) an example of a
    language processing application?

37
Applications
  • Word count?
  • When it counts words Yes
  • To count words you need to know what a word is.
    Thats knowledge of language.
  • When it counts lines and bytes No
  • Lines and bytes are computer artifacts, not
    linguistic entities

38
Big applications
  • Question answering
  • Conversational agents
  • Summarization
  • Machine translation

39
Big applications
  • These kinds of applications require a tremendous
    amount of knowledge of language.
  • Consider the interaction with HAL the computer
    from 2001 A Space Odyssey, on the next slide.

40
HAL from 2001
  • Dave Open the pod bay doors, Hal.
  • HAL Im sorry Dave, Im afraid I cant do that.

41
Whats needed?
  • Speech recognition and synthesis
  • Knowledge of the English words spoken
  • What they mean
  • How groups of words clump
  • What the clumps mean
  • Dialog
  • It is polite to respond, even if youre planning
    to kill someone.
  • It is polite to pretend to want to be cooperative
    (Im afraid, I cant)

42
Caveat
  • NLP has an AI aspect to it.
  • We often deal with ill-defined problems.
  • We dont often come up with exact
    solutions/algorithms.
  • We cant let either of those facts get in the way
    of making progress.

43
Course material
  • Well be intermingling discussions of
  • Linguistic topics
  • E.g. Morphology, syntax, discourse structure
  • Formal systems
  • E.g. Regular languages, context-free grammars
  • Applications
  • E.g. Machine translation, information extraction

44
Topics Linguistics
  • Word-level processing
  • Syntactic processing
  • Lexical and compositional semantics
  • Discourse processing
  • Dialogue structure

45
Topics Techniques
  • Finite-state methods
  • Context-free methods
  • Augmented grammars
  • Unification
  • Lambda calculus
  • First order logic
  • Probability models
  • Supervised machine learning methods

46
Topics Applications
  • Stand-alone
  • Enabling applications
  • Funding/Business plans
  • Small
  • Spelling correction
  • Hyphenation
  • Medium
  • Word-sense disambiguation
  • Named entity recognition
  • Information retrieval
  • Large
  • Question answering
  • Conversational agents
  • Machine translation

47
Categories of knowledge
  • Phonology
  • Morphology
  • Syntax
  • Semantics
  • Pragmatics
  • Discourse
  • Each kind of knowledge has associated with it
    an encapsulated set of processes that make use of
    it.
  • Interfaces are defined that allow the various
    levels to communicate.
  • This usually leads to a pipeline architecture.

48
Ambiguity
  • Computational linguists are obsessed with
    ambiguity
  • Ambiguity is a fundamental problem of
    computational linguistics
  • Resolving ambiguity is a crucial goal

49
Ambiguity
  • Find at least 5 meanings of this sentence
  • I made her duck

50
Ambiguity
  • Find at least 5 meanings of this sentence
  • I made her duck
  • I cooked waterfowl for her benefit (to eat)
  • I cooked waterfowl belonging to her
  • I created the (plaster?) duck she owns
  • I caused her to quickly lower her head or body
  • I waved my magic wand and turned her into
    undifferentiated waterfowl

51
Ambiguity is pervasive
  • I caused her to quickly lower her head or body
  • Lexical category duck can be a N or V
  • I cooked waterfowl belonging to her.
  • Lexical category her can be a possessive (of
    her) or dative (for her) pronoun
  • I made the (plaster) duck statue she owns
  • Lexical Semantics make can mean create or
    cook

52
Ambiguity is pervasive
  • Grammar Make can be
  • Transitive (verb has a noun direct object)
  • I cooked waterfowl belonging to her
  • Ditransitive (verb has 2 noun objects)
  • I made her (into) undifferentiated waterfowl
  • Action-transitive (verb has a direct object and
    another verb)
  • I caused her to move her body

53
Ambiguity is pervasive
  • Phonetics!
  • I mate or duck
  • Im eight or duck
  • Eye maid her duck
  • Aye mate, her duck
  • I maid her duck
  • Im aid her duck
  • I mate her duck
  • Im ate her duck
  • Im ate or duck
  • I mate or duck

54
Dealing with ambiguity
  • Four possible approaches
  • Tightly coupled interaction among processing
    levels knowledge from other levels can help
    decide among choices at ambiguous levels.
  • Pipeline processing that ignores ambiguity as it
    occurs and hopes that other levels can eliminate
    incorrect structures.

55
Dealing with ambiguity
  • Probabilistic approaches based on making the most
    likely choices
  • Dont do anything, maybe it wont matter
  • Well leave when the duck is ready to eat.
  • The duck is ready to eat now.
  • Does the duck ambiguity matter with respect to
    whether we can leave?

56
Models and algorithms
  • By models we mean the formalisms that are used to
    capture the various kinds of linguistic knowledge
    we need.
  • Algorithms are then used to manipulate the
    knowledge representations needed to tackle the
    task at hand.

57
Models
  • State machines
  • Rule-based approaches
  • Logical formalisms
  • Probabilistic models

58
Algorithms
  • Many of the algorithms that well study will turn
    out to be transducers algorithms that take one
    kind of structure as input and output another.
  • Unfortunately, ambiguity makes this process
    difficult. This leads us to employ algorithms
    that are designed to handle ambiguity of various
    kinds

59
Paradigms
  • In particular..
  • State-space search
  • To manage the problem of making choices during
    processing when we lack the information needed to
    make the right choice
  • Dynamic programming
  • To avoid having to redo work during the course of
    a state-space search
  • CKY, Earley, Minimum Edit Distance, Viterbi,
    Baum-Welch
  • Classifiers
  • Machine learning based classifiers that are
    trained to make decisions based on features
    extracted from the local context

60
State space search
  • States represent pairings of partially processed
    inputs with partially constructed
    representations.
  • Goals are inputs paired with completed
    representations that satisfy some criteria.
  • As with most interesting problems the spaces are
    normally too large to exhaustively explore.
  • We need heuristics to guide the search
  • Criteria to trim the space

61
Dynamic programming
  • Dont do the same work over and over.
  • Avoid this by building and making use of
    solutions to sub-problems that must be invariant
    across all parts of the space.

62
Next time
  • Download and install Python and NLTK
  • See info at http//www.nltk.org/ gt Download
  • SLP 2.1 Ex 2.1-.2
  • NLPP 1.1-.2 Ex 1.8.1-.5
Write a Comment
User Comments (0)
About PowerShow.com