Statistical Natural Language Processing - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Statistical Natural Language Processing

Description:

Special Topics in Computer Science: Statistical Natural Language Processing ... How does Babelfish translate documents from German to English? ... – PowerPoint PPT presentation

Number of Views:244
Avg rating:3.0/5.0
Slides: 25
Provided by: VasileiosH9
Category:

less

Transcript and Presenter's Notes

Title: Statistical Natural Language Processing


1
Statistical Natural Language Processing
  • Introduction
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Course information
  • CS 6V81 003
  • Special Topics in Computer Science Statistical
    Natural Language Processing
  • Meets Monday and Wednesday
    400515pm at ECSS 2.201

3
Instructor
  • Vasileios Hatzivassiloglou
  • Associate Professor, Computer Science
  • Founding Professor, Bioengineering
  • Research focus Discover knowledge from massive
    amounts of raw data
  • data not the same as information
  • information overload

4
Research Interests
  • Text analysis, machine learning, intelligent
    information retrieval, text mining,
    summarization, question answering,
    bioinformatics, medical informatics

5
Contact information
  • Office hours Monday-Wednesday 530-630pm and by
    appointment
  • Office location ECSS 3.406
  • vh_at_hlt.utdallas.edu
  • (972) 883-4342
  • Web page To be announced
  • Teaching Assistant TBA

6
Course goals
  • Introduce the field of statistical natural
    language processing (statistical NLP).
  • Describe the main directions, problems, and
    algorithms in the field.
  • Discuss the theoretical foundations for modern
    text analysis, information retrieval, and speech
    understanding.
  • Involve students in hands-on experiments with
    real problems.

7
Teaser
  • How does Google determine which page to rank
    higher among ones with similar content?
  • How does Babelfish translate documents from
    German to English?
  • How does ETS score SAT essays without the
    computer really understanding what is written?

8
Sample problems
  • Extract information from political speeches or
    blogs to find differences between Obama and Bush
    or Clinton
  • Generate text using statistical dependencies that
    looks like it was written by a human
  • Automatically process movie commentary and
    reviews and assign a numerical score

9
Intended audience
  • For students with or without prior knowledge of
    natural language processing
  • Focus on large text collections and applications
    emphasis on text mining
  • Coverage of machine learning background
  • Limited algorithmic analysis / computational
    complexity
  • Medium level of programming

10
Prerequisites
  • None officially
  • You should know
  • Basic data structures (multidimensional arrays,
    hash tables, binary trees) CS 5343
  • Basic concepts of probability theory and random
    variables CS 3341
  • One high-level programming language

11
You need not know
  • Machine learning
  • Data mining (in general) or Text mining
  • Text analysis / natural language processing
  • Information retrieval
  • Artificial intelligence

12
Course level
  • Introductory graduate course (MS or
    first/second-year PhD)
  • Maturity in programming and data structures as of
    a CS senior
  • Ability (and interest in) accessing the primary
    literature in a guided fashion

13
Course structure
  • Lectures by the instructor
  • Overview (4 sessions)
  • Data preparation (2 sessions)
  • Modeling Dependencies (5 sessions)
  • Applications (2 sessions)
  • Lexical Semantics (4 sessions)
  • Hypothesis Testing (3 sessions)
  • Similarity and Clustering (3 sessions)
  • Multi-level Dependencies (2 sessions)

14
Course structure
  • Student presentations of research projects or
    papers (2 sessions)

15
Expected work load
  • 2-3 homework sets given in February and March
  • Two weeks to turn in each homework set
  • Mid-term exam in mid/late March
  • Student project / review of literature starts in
    late March
  • Student presentations of research papers /
    projects in the last week of April
  • Final exam on May 8

16
Student project
  • Each student chooses to either
  • Partner with another student and work together on
    a short project of their choice (background
    review, implementation, data collection,
    reporting of methods and results)
  • Review a set of related papers from the primary
    literature and summarize them for the class (2-3
    papers chosen by the student with the advice and
    consent of the instructor)

17
Grading
  • Class participation 20
  • Homework 30 (total)
  • Midterm 10
  • Project / literature review 20
  • Final 20

18
Programming
  • Each student selects their own programming
    language (must be accessible to TA and run in the
    UTD environment)
  • Examples C, C, Java, Perl, Python

19
Primary textbook
  • "Foundations of Statistical Natural Language
    Processing" by Christopher D. Manning and Hinrich
    Schütze, The MIT Press, Cambridge, Massachusetts,
    May 1999.
  • ISBN 0-262-13360-1
  • 620 pages
  • Available on Amazon.com for 66, Barnes and Noble
    for 82

20
Background reading
  • Statistics The elements of statistical
    learning data mining, inference, and prediction
    by Trevor Hastie, Robert Tibshirani and Jerome
    Friedman, 2001.
  • Data structures and algorithms Introduction to
    Algorithms, by Thomas H. Cormen, Charles E.
    Leiserson, Ronald L. Rivest, and Clifford Stein,
    2nd edition, 2001.

21
So what is it all about?
  • Natural language processing is the study and
    implementation of computer methods for
    understanding and generating human language
    (English, French, Urdu, etc.)
  • The discipline of linguistics studies human
    language mechanisms (as opposed to products) in
    general
  • Thus, NLP is often referred to as computational
    linguistics

22
Human language
  • An extremely complex communication system
  • Many symbols (words), many constraints on what
    words can go together, many complications on the
    meaning of combinations of words

23
A comparison with a programming language
  • A programming language has a small inventory of
    words with pre-assigned meaning
  • A programming language has strict rules for
    allowed sequences of words (syntax)
  • The meaning of each word is unique
  • The meaning of a sequence of words is unique
    (compositionality)

24
Levels of NL study
  • Word formation from letters (morphology)
  • Construction of longer units obeying class
    constraints (syntax)
  • Interpreting the meaning of words (lexical
    semantics) and longer units (semantics)
  • Complications from the environment and context
    (pragmatics)
  • Acoustic properties (phonetics)
Write a Comment
User Comments (0)
About PowerShow.com