Text Processing - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Text Processing

Description:

stresses stress; but not forbes forb. Excessive 'normalization' wander wand, but sander sand ... Rankings fusion. Cross-stream dependencies. 26. Stream Model ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 37
Provided by: tcnj
Category:

less

Transcript and Presenter's Notes

Title: Text Processing


1
Text Processing
  • Purpose prepare text for indexing and retrieval
  • Standard text preparation processes
  • Stopping
  • Stemming
  • Collocations
  • Advanced text preparation processes
  • Tagging
  • Parsing identifying structure
  • HM Pairs
  • Concept extraction
  • Co-references and cross-references

2
Text Preparation
INDEX
Text Processing
Search
What recent disasters occurred in tunnels?
3
Typical Text Processing Steps
stopping
stemming
colloc.
Text
HM
tagging
parsing
concepts
names
4
Stopping
  • Elimination of stopwords
  • Not used in indexing
  • Not considered content words
  • standard list http//www.uspto.gov/patft/stopwo
    rd.htm
  • Elimination of common words symbols
  • Domain dependent, e.g., today in news
  • Elimination of annotations, symbols

5
Stemming
  • Reducing words to root forms
  • Eliminate morphological variations
  • retrieval, retrieved, retrieving, ? retriev
  • Break/unbreak multi-word compounds
  • stop-words, stop words, stopwords
  • Detect negations
  • relevant, non-relevant, irrelevant, not relevant
  • in order to increase retrieval probability
    (recall)
  • variants considered synonymous (?)
  • statistics more accurate

6
Stemming Approaches
  • Standard word cutters (e.g. Porters)
  • Use a list of standard word endings
  • -ing, -s, -es, -ed, -ally,
  • Usually cuts off the longest matching suffix
  • But makes sure the stem left not too short
  • Morphological
  • Performs morphological analysis of each word
  • Requires part-of-speech information (why?)
  • Dictionary-based
  • Uses a lexicon to reduce words to root form
    (rather than cut suffix)

7
Common Problems
  • Insufficient normalization
  • stress ? stres stresses ? stresse
  • stresses ? stress but not forbes ? forb
  • Excessive normalization
  • wander ? wand, but sander ? sand
  • probate ? prob ? probe not both!

8
Dictionary-based Stemmer
  • Described in Strzalkowski94
  • Uses on-line dictionary (MRD)
  • For each word
  • Determine part of speech N,V,ADJ,ADV,
  • Determine inflexion patterns ? legal suffixes
  • Cutoff the longest matching suffix
  • Add on (verb) root-form ending if required
  • Verify root form against the dictionary
  • If step 5 fails, repeat 3 through 5 for other
    suffixes

9
Stemming Example
  • retrieval ? retrieve
  • retrieval (N)
  • -es, -s, -al,
  • retrieval ? retriev
  • e
  • retriev e
  • OK retrieve

10
Stemming Example
  • retrieved ? retrieve
  • retrieved (VBN, VBD)
  • -ed, -en
  • retrieved ? retriev
  • e
  • retriev e
  • OK retrieve

11
Collocations
  • Identifying words that frequently come together,
    because they may
  • Denote concepts White House, senior citizen,
    joint venture
  • Predict the presence of the other word in text
  • Can be used as units in indexing
  • Should the component words be used as well?
  • Mutual Information formula is useful
  • Collocations may be specific to domains/text
    genres

12
Part-of-Speech Tagging
  • Goal tag all words in text with POS info
  • Part of speech classes in English
  • Nouns cat, dog, retrieval,
  • Verbs buy, walk, argued, processing,
  • Adjectives red, white, happy,
  • Adverbs fast, slowly, carefully,
  • Conjuctions and, or, but,
  • Determiner the, this, some,
  • Automated systems usually more detailed tagset

13
Why POS tagging?
  • POS tag depends upon word use in context
  • They drive (V) very fast.
  • My disk drive (N) crashed.
  • Tagging removes some ambiguity that arises from
    treating words separately
  • Tagged text can be analysed for phrases and other
    compounds

14
Example of POS tagged text
  • For McCaw, it would have hurt the company's
    strategy of building a seamless national cellular
    network.
  • For/in McCaw/pn, it/pp would/md have/vb hurt/vbn
    the/dt company/nn 's/pos strategy/nn of/in
    building/vbg a/dt seamless/jj national/jj
    cellular/jj network/nn

15
Phrase Identification
  • POS tags can be used to identify phrases
  • NP (dt) ((rb) (jj)) (nn pos) nnnnspn
  • For/in McCaw/pn , it/pp would/md have/vb
    hurt/vbn the/dt company/nn 's/pos strategy/nn
    of/in building/vbg a/dt seamless/jj national/jj
    cellular/jj network/nn

16
How POS tagging works?
  • Stochastic approaches (e.g., Kupiec)
  • Use HMM over word trigrams
  • Requires training, accuracy up to 98
  • Rule-based approaches (e.g., Brill)
  • Initial tags from a lexicon ambiguous
  • Some empirical rules, e.g., no vb after dt
  • Supervised error-driven learning accuracy up to
    98
  • Learn tag preferences for words tag ranking
  • Learn tagging rules (e.g., nn preferred after jj)

17
Error-Driven Learning
  • A powerful paradigm for supervised machine
    learning.
  • Proposed by Eric Brill in his PhD work (1992)
  • Applications in routing and classification
  • General idea
  • Assume initial classification could be a guess
  • Manually correct errors
  • Have the system correct its behavior to
    accommodate the corrections
  • Use unbiased training data

18
Parsing
  • Use English (or other language) grammar to derive
    full (or approximate) syntactic structure of
    text.
  • Hand constructed grammars (generic)
  • Stochastic grammars (derived from training texts)
  • Identify phrases, word-dependencies
  • Normalize structural variants

19
Parsing Example
assert will_aux perf have verb
hurt subject np n it object
np n strategy t_pos the
n_pos poss n company of
verb build
subject anyone
object np n network t_pos a
adj seamless
adj national
adj
cellular
20
Graphical Parsing
assert
predicate
subject
object
aux
it
perf
verb
will
NP
tpos
npos
head
hurt
have
strategy
21
HeadModifier dependencies
assert will_aux perf have verb
hurt subject np n it object
np n strategy t_pos the
n_pos poss n company of
verb build
subject anyone
object np n network t_pos a
adj seamless
adj national
adj
cellular
head
modifier
head
modifier
22
Head-Modifier Structures
  • HM Dependencies extracted from this parse
    (headmodifier format)
  • hurtstrategy, strategycompany,
  • buildnetwork,
  • networkcellular, networknational,
    networkseamless
  • Can be used as indexing terms
  • More refined than simple phrases
  • Normalization across all syntactic forms
    (problems?)
  • Can be nested or un-nested pairs
  • Order information important
  • venetianblind ? blindvenetian
  • collegefreshman ? freshmancollege

23
Stream Model
phrases
phrases
names
names
search topics
HM pairs
HM pairs
Merged Ranked output
24
Stream Model IR
Query
feedback
words
phrases
people
locations
weapons
Search Engines
Indexes
NLP
fuse
text
Summarize Present
25
Stream Model Evaluation
  • Compare performance of different indexing
    approaches
  • Determine what is the contribution of each stream
    to overall result
  • Uncertainty factors
  • Rankings fusion
  • Cross-stream dependencies

26
Stream Model Evaluation (TREC-5)
27
Concept Extraction
  • Explicit identification of references to
  • Named entities people, organizations, locations
  • Events and relationships
  • Detection of small text regions that
  • Have features indicating presence of concepts
  • No explicit extraction cant use in index
    (why?)
  • Used for query expansion relevance feedback

28
Concept-based Indexing
  • Index documents using concepts such as
  • entities, events, relations
  • E.g. disasters in tunnels
  • But how to represent disaster, tunnel, etc?
  • But how to recognize disaster, tunnel in text?
  • How to represent documents with concepts?
  • Weighted keys?
  • Semantic maps?

29
Using concepts to enrich BOW
  • Add compound terms and concepts to documents
  • Treat as tokens in the bag-of-words
  • Weigh just as other tokens
  • tfidf based on distribution
  • A function of weights of component words
  • Ad-hoc
  • Neither approach satisfactory (why not?)

30
Detecting concept presence
  • Supervised Machine Learning
  • Use human annotated text for training
  • Extract context cues that indicate concepts of a
    given kind
  • Construct first-cut recognizers
  • Unsupervised fitting
  • Apply recognizer to new training data
    (un-annotated)
  • Learn more context cues from where the concepts
    occur
  • Revise recognizer rules, iterate until stable

31
Self-Learning Concept Spotter
  • Proposed by Strzalkowski Wang, 1996
  • Start from seed descriptions/naïve rules
  • Bootstrap from examples found in text
  • Very effective, converges quickly
  • Accuracy rivals human-made grammars

32
S-LCS Example
Seed rules COMPANY NP Co. NP Inc.
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
33
S-LCS Example
Seed rules COMPANY PN Co. PN Inc. Add
COMPANY president chairman of np PN
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
34
S-LCS Example
Seed rules COMPANY PN Co. PN Inc. Add
COMPANY president chairman of np PN Add
COMPANY PN S.A. PN Banken
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
35
Co- and Cross-References
  • Co-references usually pronouns, definite
    descriptions
  • Tracking to get counts right
  • Not an easy problem generally
  • Cross-references are across documents
  • Is this the same person, place, event?
  • Generalizes to topic detection

36
XDC Approach
  • Proposed by Amit Bagga and Breck Baldwin
  • disambiguate entities/events across documents
  • done by looking at context around entity/event
  • Expected to differ for Michael Jordan (NBA) and
    Prof. Michael I. Jordan (UC Berkeley)
  • context extracted in the form of a summary for
    each entity/event
  • sentence selection
  • contexts are compared currently using the Vector
    Space Model
Write a Comment
User Comments (0)
About PowerShow.com