Introduction to Natural Language Processing and Text Mining and The basic building blocks - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Natural Language Processing and Text Mining and The basic building blocks

Description:

Computer Science & Engineering Department. Indian Institute of ... Morphophonemics. Morphemes and allomorphs. eg {plur}: (e)s, vowel change, y ies, f ves, um ... – PowerPoint PPT presentation

Number of Views:959
Avg rating:3.0/5.0
Slides: 28
Provided by: DanJur1
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Natural Language Processing and Text Mining and The basic building blocks


1
Introduction to Natural Language Processing and
Text MiningandThe basic building blocks
  • Sudeshna Sarkar
  • Professor
  • Computer Science Engineering Department
  • Indian Institute of Technology Kharagpur

2
Ambiguity
  • At last, a computer that understands you like
    your mother.
  • -- 1985 McDonnell-Douglas Ad
  • Different interpretations
  • The computer understands you as well as your
    mother understands you.
  • The computer understands that you like your
    mother.
  • The computer understands you as well as it
    understands your mother.
  • Speech .. a computer that understands your lie
    cured mother

3
Why is NLP difficult?
  • Natural Language is highly ambiguous.
  • Syntactic ambiguity
  • The president spoke to the nation about the
    problem of drug use in the schools from one coast
    to the other.
  • has 720 parses.
  • Ex
  • to the other can attach to any of the previous
    NPs (ex. the problem), or the head verb ? 6
    places
  • from one coast has 5 places to attach

4
Why is NLP difficult?
  • Word category ambiguity
  • book --gt verb? or noun?
  • Word sense ambiguity
  • bank --gt financial institution? building? or
    river side?
  • Words can mean more than their sum of parts
  • make up a story
  • Fictitious worlds
  • People on mars can fly.
  • Defining scope
  • People like ice-cream.
  • Does this mean that all (or some?) people like
    ice cream?
  • Language is changing and evolving
  • Ill email you my answer.
  • This new S.U.V. has a compartment for your mobile
    phone.
  • Googling,

5
Why is NLP hard?
  • Natural language is
  • Highly ambiguous at all levels
  • Complex
  • Probabilistic, fuzzy
  • Involves reasoning about the world
  • Deals with complex social interactions
  • Why Text is tough?
  • Abstract concepts are difficult to represent
  • Countless combinations of subtle, abstract
    relationships among concepts
  • Many ways to represent similar concepts
  • Concepts are difficult to visualize
  • High dimensionality - Tens or hundreds of
    thousands of features

6
How is NLP doable?
  • But in some senses NLP is quite easy
  • Rough text features good enough for many useful
    tasks
  • Why Text is easy?
  • Highly redundant data
  • Just about any simple algorithm can get good
    results for simple tasks
  • Pull out important phrases
  • Find meaningfully related words
  • Create some sort of summary from documents

7
Levels of Text Processing
  • Word Level
  • Words Properties
  • Stop-Words
  • Stemming
  • Frequent N-Grams
  • Thesaurus (WordNet)
  • Sentence Level
  • Document Level
  • Document-Collection Level
  • Linked-Document-Collection Level
  • Application Level

8
Models and Algorithms
  • Models formalisms used to capture the various
    kinds of linguistic structure.
  • State machines (fsa, transducers, markov models)
  • Formal rule systems (context-free grammars,
    feature systems)
  • Logic (predicate calculus, inference)
  • Probabilistic versions of all of these others
    (gaussian mixture models, probabilistic
    relational models, etc etc)
  • Algorithms used to manipulate representations to
    create structure.
  • Search (A, dynamic programming)
  • EM
  • Supervised learning, etc etc

9
Language Processing Pipeline
speech
text
POS tagging
WSD
Shallow parsing
Deep Parsing
Anaphora resolution
Integration
10
The Big Picture
Source Language Speech Signal
Target Language Speech Signal
Speech recognition
Speech Synthesis
Target text Generation
Source text Analysis
11
Some Building Blocks
Source Language Analysis
Target Language Generation
Text Normalization
Text Rendering
Morphological Analysis
Morphological Synthesis
POS Tagging
Phrase Generation
Parsing
Role Ordering
Semantic Analysis
Lexical Choice
Discourse Analysis
Discourse Planning
12
Two Approaches
  • Symbolic
  • Encode all the necessary knowledge
  • Good when annotated data is not available
  • Allows steady development
  • The development can be monitored
  • Fits well with logic and reasoning in AI
  • Statistical
  • Learn language from its usage
  • Supervised learning require large collections
    manually annotated with meta-tags
  • Development is almost blind
  • Few ways to check the correctness
  • Debugging is very frustrating

13
Resolve Ambiguities
  • We will introduce models and algorithms to
    resolve ambiguities at different levels.
  • part-of-speech tagging -- Deciding whether duck
    is verb or noun.
  • word-sense disambiguation -- Deciding whether
    make is create or cook.
  • lexical disambiguation -- Resolution of
    part-of-speech and word-sense
    ambiguities are two important kinds of lexical
    disambiguation.
  • syntactic ambiguity -- her duck is an example of
    syntactic ambiguity, and can be addressed by
    probabilistic parsing.

14
Languages
  • Languages 39,000 languages and dialects (22,000
    dialects in India alone)
  • Top languages
  • Chinese/Mandarin (885M),
  • Spanish (332M),
  • English (322M),
  • Bengali (189M),
  • Hindi (182M),
  • Portuguese (170M), Russian (170M), Japanese
    (125M)
  • Source www.sil.org/ethnologue, www.nytimes.com
  • Internet English (128M), Japanese (19.7M),
    German (14M), Spanish (9.4M), French (9.3M),
    Chinese (7.0M)
  • Usage English (1999-54, 2001-51, 2003-46,
    2005-43)
  • Source www.computereconomics.com

15
  • Tokenization
  • Segmentation
  • Stemming/ lemmatization

16
Morphology
  • Morphology is the field of linguistics that
    studies the internal structure of words
  • How words are built up from smaller meaningful
    units called morphemes (morph shape, logos
    word)
  • We can usefully divide morphemes into two classes
  • Stems The core meaning bearing units
  • Affixes Bits and pieces that adhere to stems to
    change their meanings and grammatical functions
  • Prefix un-, anti-, etc (a- ati- pra- etc)
  • Suffix -ity, -ation, etc ( -taa, -ke, -ka etc)
  • Infix are inserted inside the stem
  • Tagalog um hingi? humingi
  • Circumfixes precede and follow the stem
  • Turkish can have words with a lot of suffixes
    (agglutinative language) Many indian languages
    also have agglutinative suffixes

17
Examples (English)
  • unladylike
  • 3 morphemes, 4 syllables
  • un- not
  • lady (well behaved) female adult human
  • -like having the characteristics of
  • Cant break any of these down further without
    distorting the meaning of the units
  • dogs
  • 2 morphemes, 1 syllable
  • -s, a plural marker on nouns

18
Examples (Bengali)
  • chhelederTaakei
  • 5 morphemes
  • chhele boy
  • -der plural genitive
  • -Taa classifier
  • -ke dative
  • -i emphasizer
  • Cant break any of these down further without
    distorting the meaning of the units
  • atipraakrritake
  • ati-
  • praakrrita
  • -ke

19
Inflectional Derivational Morphology
  • We can also divide morphology up into two broad
    classes
  • Inflectional
  • Derivational
  • Inflectional morphology is grammatical
  • number, tense, case, gender
  • Derivational morphology concerns word building
  • part-of-speech derivation
  • words with related meaning

20
Inflectional Morphology
  • Inflection
  • Variation in the form of a word, typically by
    means of an affix, that expresses a grammatical
    contrast.
  • Doesnt change the word class
  • Usually produces a predictable, nonidiosyncratic
    change of meaning. Eg, may add tense, number,
    person, mood, aspect
  • Serves a grammatical/semantic purpose different
    from the original
  • Highly systematic, though there may be
    irregularities and exceptions
  • Simplifies lexicon, only exceptions need to be
    listed
  • Unknown words may be guessable
  • After a combination with an inflectional
    morpheme,
  • the meaning and class of the actual stem usually
    do not change.
  • eat / eats pencil / pencils
  • helaa / khele / khelchhila bai / baiTAke /
    baiyera

21
Derivational Morphology
  • Derivation
  • The formation of a new word or inflectable stem
    from another word or stem.
  • After a combination with an derivational
    morpheme, the meaning and the class of the actual
    stem usually change.
  • compute / computer do / undo friend /
    friendly
  • Uygar / uygarlas kapi / kapici
  • udaara (J) / udaarataa (N)
  • bhadra / abhadra
  • baayu / baayabiiya
  • Irregular changes may happen with derivational
    affixes.
  • Fairly systematic, and predictable up to a point
  • Simplifies description of lexicon regularly
    derived words need not be listed
  • Unknown words may be guessable
  • But
  • Apparent derivations have specialised meaning
  • Some derivations missing

22
Morphological processes
  • Affixes prefix, suffix, infix, circumfix
  • Vowel change (umlaut, ablaut)
  • Gemination, (partial) reduplication
  • Root and pattern
  • Stress (or tone) change
  • Sandhi

23
Concatenative Morphology
  • MorphemeMorphemeMorpheme
  • Stems also called lemma, base form, root, lexeme
  • hopeing ? hoping hop ? hopping
  • Affixes
  • Prefixes Antidisestablishmentarianism
  • Suffixes Antidisestablishmentarianism
  • Infixes hingi (borrow) humingi (borrower) in
    Tagalog
  • Circumfixes sagen (say) gesagt (said) in
    German
  • Agglutinative Languages
  • uygarlastiramadiklarimizdanmissinizcasina
  • uygarlastiramadiklarimizdanmissinizcasin
    a
  • Behaving as if you are among those whom we could
    not cause to become civilized

24
Morphophonemics
  • Morphemes and allomorphs
  • eg plur (e)s, vowel change, y?ies, f?ves, um
    ?a, ?, ...
  • Morphophonemic variation
  • Affixes and stems may have variants which are
    conditioned by context
  • eg ing in lifting, swimming, boxing, raining,
    hoping, hopping
  • Rules may be generalisable across morphemes
  • eg (e)s in cats, boxes, tomatoes, matches,
    dishes, buses
  • Applies to both plur (nouns) and 3rd sing
    pres (verbs)

25
Templatic Morphology
  • Roots and Patterns
  • Example Hebrew verbs
  • Root
  • Consists of 3 consonants CCC
  • Carries basic meaning
  • Template
  • Gives the ordering of consonants and vowels
  • Specifies semantic information about the verb
  • Active, passive, middle voice
  • Example
  • lmd (to learn or study)
  • CaCaC -gt lamad (he studied)
  • CiCeC -gt limed (he taught)
  • CuCaC -gt lumad (he was taught)

26
Syntax and Morphology
  • Phrase-level agreement
  • Subject-Verb
  • John studies hard (STUDY3SG)
  • Noun-Adjective
  • Achchhi Ladki
  • In some languages like Sanskrit, morphology
    contains a lot of information about structure

27
Morphology in NLP
  • Analysis vs synthesis
  • what does dogs mean? vs what is the plural of
    dog?
  • Analysis
  • Need to identify lexeme
  • Tokenization
  • To access lexical information
  • Inflections (etc) carry information that will be
    needed by other processes (eg agreement useful in
    parsing, inflections can carry meaning (eg tense,
    number)
  • Morphology can be ambiguous
  • May need other process to disambiguate (eg German
    en)
  • Synthesis
  • Need to generate appropriate inflections from
    underlying representation
Write a Comment
User Comments (0)
About PowerShow.com