Title: Text Mining and Relation Detection
1Text Mining and Relation Detection
- Heng Ji
- hengji_at_cs.qc.cuny.edu
- March 3, 2009
2Analysis for Assignment 1
- Novels, Emails, Programming books, Obama and
Bushs speech - Unigram appears similar between two corpora
- Bigram usually appears quite differently
- Adding a stop-word list may help statistics
- Later submission policy 0.2 penalty/day unless
special permission
3Outline
- Text Mining Tools, Techniques, and Applications
- Motivations and Example Applications
- Text Mining Defined
- Foundations of Text Mining
- Case Study on Relation Detection K-Nearest
Neighbor
4Motivation
- Approximately 90 of the worlds data is held in
unstructured formats (source Oracle Corporation) - Information intensive business processes demand
that we transcend from simple document retrieval
to knowledge discovery.
Structured Numerical or Coded Information
10
Unstructured or Semi-structured Information
90
5Text Mining Applications
- Marketing Discover distinct groups of potential
buyers according to a user text based profile - e.g. amazon
- Industry Identifying groups of competitors web
pages - e.g., competing products and their prices
- Job seeking Identify parameters in searching for
jobs - e.g., www.flipdog.com
6Example 1 Mining Medical Literature
- Medical research
- Find causal links between symptoms or diseases
and drugs or chemicals.
7Example 1 Medical Research
- Imagine any knowledge about migraines?
- A text miner can discover
- stress is associated with migraines
- stress can lead to loss of magnesium
- calcium channel blockers prevent some migraines
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD) is implicated
in some migraines - high levels of magnesium inhibit SCD
- migraine patients have high platelet
aggregability - magnesium can suppress platelet aggregability
- (source Swanson and Smalheiser, 1994)
8Example 2 Social Networks
A social network is a description of the social
structure between actors, mostly individuals or
organizations. It indicates the ways in which
they are connected through various social
familiarities ranging from casual acquaintance
to close familiar bonds. Demo
http//langtech.jrc.it/entities/socNet/last.html
9Example 3 Relation Detection (todays Sample
Project Analysis)
- relation a semantic relationship between two
entities - ACE relation type example
- Agent-Artifact Rubin Military Design, the
makers of the Kursk - Discourse each of whom
- Employment/ Membership Mr. Smith, a senior
programmer at Microsoft - Place-Affiliation Salzburg Red
Cross officials - Person-Social relatives of the dead
- Physical a town some 50 miles south of
Salzburg - Other-Affiliation Republican senators
Application Example Employee CV/Publication
mining Demo http//arnetminer.com
http//www.cs.washington.edu/research/textrunner/
http//belobog.si.umich.edu/clair/an
thology/other_search.cgi
10Example 4 Sentiment Analysis
- Extracting hidden meaning or sentiment based on
use of language. - Examples
- Customer is unhappy with their service!
- Sentiment discontent
- Sentiment is
- Emotions fear, love, hate, sorrow
- Feelings warmth, excitement
- Mood, disposition, temperament,
- Or even (someday)
- Lies, sarcasm
11Example 4 Decision Support using Bank Call
Center Data
- The Information Source
- Call center records
- Example
- Quick answers to important questions
- Which offices receive the most angry calls?
- What products have the fewest satisfied
customers? - (Angry and Satisfied are recognizable
sentiments)
AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK,
NY, H-SUPRVR8, STMT, mr stark has been with the
company for about 20 yrs. He hates his stmt
format and wishes that we would show a daily
balance to help him know when he falls below the
required balance on the account.
12Example 5Personalized Movie Matcher
- The Need
- Match movies to individuals based on preference
profile - The Information
- Written reviews of movies
- Users lists of favorite movies.
Sentiment Analysis
Movie Reviews
Typed and Tagged Reviews
13Personalized Movie Matcher
absurdity
Action
conflict
insecurity
1
Romance
crime
injustice
0
inferiority
death
deception
immorality
horror
destruction
fear
14Outline
- Text Mining Tools, Techniques, and Applications
- Motivations and Example Applications
- Text Mining Defined
- Foundations of Text Mining
- Case Study on Relation Detection K-Nearest
Neighbor
15Text Mining Defined
- Discover useful and previously unknown gems of
information in large text collections
Patterns
Trends
Associations
16What we learned in last class Comparing IR to
databases
17Search versus Discover
Search (goal-oriented)
Discover (opportunistic)
Structured Data
Data Mining
Data Retrieval
Unstructured Data (Text)
Text Mining
Information Retrieval
18Data Retrieval
- Find records within a structured database.
19Information Retrieval
- Find relevant information in an unstructured
information source (usually text)
20Data Mining
- Discover new knowledge through analysis of data
21Text Mining
- Discover new knowledge through analysis of text
22Outline
- Text Mining Tools, Techniques, and Applications
- Motivations and Example Applications
- Text Mining Defined
- Foundations of Text Mining
- Case Study on Relation Detection K-Nearest
Neighbor
23Challenges of Text Mining
- Very high number of possible dimensions
- All possible word and phrase types in the
language!! - Noisy data
- Example Spelling mistakes
- Not well structured text
- Chat rooms
- r u available ?
- Hey whazzzzzz up
- Speech
- Complex and subtle relationships between concepts
in text - AOL merges with Time-Warner
- Time-Warner is bought by AOL
- Ambiguity and context sensitivity
- automobile car vehicle Toyota
- Apple (the company) or apple (the fruit)
24Text mining process
25Text mining process
- Text preprocessing
- Syntactic/Semantic text analysis
- Features Generation
- Bag of words
- Features Selection
- Simple counting
- Statistics
- Text/Data Mining
- Classification
- - Supervised learning
- Clustering-
- Unsupervised learning
- Analyzing results
26Text Processing
- Statistical Analysis
- Quantify text data
- Language or Content Analysis
- Identifying structural elements
- Extracting and codifying meaning
- Reducing the dimensions of text data
27Name Tagging
GPE
X
PER
PER
PER
George
W.
Bush
discussed
Iraq
ltPERgtGeorge W. Bushlt/PERgt discussed
ltGPEgtIraqlt/GPEgt
28Feature selection
- If we just use bag of words, high-cost
- Reduce dimensionality
- Learners have difficulty addressing tasks with
high dimensionality - Irrelevant features
- Not all features help!
- e.g., the existence of a noun in a news article
is unlikely to help classify it as politics or
sport
29Feature Selection Example
GPE
X
PER
PER
PER
ltGeorge, Capitalizedgt
ltW., Initialgt
ltdiscussed, Othergt
ltBush, Capitalizedgt
ltIraq, Capitalizedgt
ltPERgtGeorge W. Bushlt/PERgt discussed
ltGPEgtIraqlt/GPEgt
30Outline
- Text Mining Tools, Techniques, and Applications
- Motivations and Example Applications
- Text Mining Defined
- Foundations of Text Mining
- Case Study on Relation Detection K-Nearest
Neighbor - Project User Interface
- Project Documentation
- Sample Project Analysis
31Case Study Relation Detection
- relation a semantic relationship between two
entities - ACE relation type example
- Agent-Artifact Rubin Military Design, the
makers of the Kursk - Discourse each of whom
- Employment/ Membership Mr. Smith, a senior
programmer at Microsoft - Place-Affiliation Salzburg Red Cross officials
- Person-Social relatives of the dead
- Physical a town some 50 miles south of
Salzburg - Other-Affiliation Republican senators
32Training Data for Relation Detection
Train Sample Employment
Train Sample Employment
the secretary of NIST
the previous president of the United States
Train Sample Located
his ranch in texas
US forces in Bahrain
Connecticuts governer
Train Sample Located
Train Sample Employment
33How would a Decision Tree do this?
No
34K Nearest Neighbor (KNN) Model
pm
p2
p1
Distances of its nearest neighbors r1, r2, ,
- KNN find the k nearest neighbors of an object
35K Nearest Neighbor (KNN) Model
Train Sample
Train Sample
Test Sample
Train Sample
Train Sample
Train Sample
K3
36K-nearest Neighbor for Relation Detection
Train Sample Employment
Train Sample Employment
the secretary of NIST
the previous president of the United States
Test Sample
Train Sample Located
the president of the United States
his ranch in texas
US forces in Bahrain
Connecticuts governer
Train Sample Located
Train Sample Employment
37Distance Counting for Relation Detection
Train Sample Employment
Train Sample Employment
the secretary of NIST
the previous president of the United States
36
Test Sample
0
Train Sample Physical
46
the president of the United States
his ranch in texas
46
26
US forces in Bahrain
Connecticuts governer
Train Sample Employment
Train Sample Physical
- If the heads of the mentions dont match 8
- If the entity types of the heads of the mentions
dont match 20 - If the intervening words dont match 10
38Syntactic / Semantic text analysis
- Part Of Speech (pos) tagging
- Find the corresponding pos for each word
- e.g., John (noun) gave (verb) the (det) ball
(noun) - 98 accurate.
- Word sense disambiguation
- Context based or proximity based
- Very accurate
- Parsing
- Generates a parse tree (graph) for each sentence
- Each sentence is a stand alone graph
39Feature Generation Bag of words
- Text document is represented by the words it
contains (and their occurrences) - e.g., Lord of the rings ? the, Lord,
rings, of - Highly efficient
- Makes learning far simpler and easier
- Order of words is not that important for certain
applications - Stemming identifies a word by its root
- e.g., flying, flew ? fly
- Reduce dimensionality
- Stop words The most common words are unlikely to
help text mining - e.g., the, a, an, you