Text Mining and Relation Detection - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Text Mining and Relation Detection

Description:

AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK, NY, H-SUPRVR8, STMT, 'mr stark ... automobile = car = vehicle = Toyota. Apple (the company) or apple (the fruit) 24 /50 ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 40
Provided by: hen4
Category:

less

Transcript and Presenter's Notes

Title: Text Mining and Relation Detection


1
Text Mining and Relation Detection
  • Heng Ji
  • hengji_at_cs.qc.cuny.edu
  • March 3, 2009

2
Analysis for Assignment 1
  • Novels, Emails, Programming books, Obama and
    Bushs speech
  • Unigram appears similar between two corpora
  • Bigram usually appears quite differently
  • Adding a stop-word list may help statistics
  • Later submission policy 0.2 penalty/day unless
    special permission

3
Outline
  • Text Mining Tools, Techniques, and Applications
  • Motivations and Example Applications
  • Text Mining Defined
  • Foundations of Text Mining
  • Case Study on Relation Detection K-Nearest
    Neighbor

4
Motivation
  • Approximately 90 of the worlds data is held in
    unstructured formats (source Oracle Corporation)
  • Information intensive business processes demand
    that we transcend from simple document retrieval
    to knowledge discovery.

Structured Numerical or Coded Information
10
Unstructured or Semi-structured Information
90
5
Text Mining Applications
  • Marketing Discover distinct groups of potential
    buyers according to a user text based profile
  • e.g. amazon
  • Industry Identifying groups of competitors web
    pages
  • e.g., competing products and their prices
  • Job seeking Identify parameters in searching for
    jobs
  • e.g., www.flipdog.com

6
Example 1 Mining Medical Literature
  • Medical research
  • Find causal links between symptoms or diseases
    and drugs or chemicals.

7
Example 1 Medical Research
  • Imagine any knowledge about migraines?
  • A text miner can discover
  • stress is associated with migraines
  • stress can lead to loss of magnesium
  • calcium channel blockers prevent some migraines
  • magnesium is a natural calcium channel blocker
  • spreading cortical depression (SCD) is implicated
    in some migraines
  • high levels of magnesium inhibit SCD
  • migraine patients have high platelet
    aggregability
  • magnesium can suppress platelet aggregability
  • (source Swanson and Smalheiser, 1994)

8
Example 2 Social Networks
A social network is a description of the social
structure between actors, mostly individuals or
organizations. It indicates the ways in which
they are connected through various social
familiarities ranging from casual acquaintance
to close familiar bonds. Demo
http//langtech.jrc.it/entities/socNet/last.html
9
Example 3 Relation Detection (todays Sample
Project Analysis)
  • relation a semantic relationship between two
    entities
  • ACE relation type example
  • Agent-Artifact Rubin Military Design, the
    makers of the Kursk
  • Discourse each of whom
  • Employment/ Membership Mr. Smith, a senior
    programmer at Microsoft
  • Place-Affiliation Salzburg Red
    Cross officials
  • Person-Social relatives of the dead
  • Physical a town some 50 miles south of
    Salzburg
  • Other-Affiliation Republican senators

Application Example Employee CV/Publication
mining Demo http//arnetminer.com
http//www.cs.washington.edu/research/textrunner/
http//belobog.si.umich.edu/clair/an
thology/other_search.cgi
10
Example 4 Sentiment Analysis
  • Extracting hidden meaning or sentiment based on
    use of language.
  • Examples
  • Customer is unhappy with their service!
  • Sentiment discontent
  • Sentiment is
  • Emotions fear, love, hate, sorrow
  • Feelings warmth, excitement
  • Mood, disposition, temperament,
  • Or even (someday)
  • Lies, sarcasm

11
Example 4 Decision Support using Bank Call
Center Data
  • The Information Source
  • Call center records
  • Example
  • Quick answers to important questions
  • Which offices receive the most angry calls?
  • What products have the fewest satisfied
    customers?
  • (Angry and Satisfied are recognizable
    sentiments)

AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK,
NY, H-SUPRVR8, STMT, mr stark has been with the
company for about 20 yrs. He hates his stmt
format and wishes that we would show a daily
balance to help him know when he falls below the
required balance on the account.
12
Example 5Personalized Movie Matcher
  • The Need
  • Match movies to individuals based on preference
    profile
  • The Information
  • Written reviews of movies
  • Users lists of favorite movies.

Sentiment Analysis
Movie Reviews
Typed and Tagged Reviews
13
Personalized Movie Matcher
absurdity
Action
conflict
insecurity
1
Romance
crime
injustice
0
inferiority
death
deception
immorality
horror
destruction
fear
14
Outline
  • Text Mining Tools, Techniques, and Applications
  • Motivations and Example Applications
  • Text Mining Defined
  • Foundations of Text Mining
  • Case Study on Relation Detection K-Nearest
    Neighbor

15
Text Mining Defined
  • Discover useful and previously unknown gems of
    information in large text collections

Patterns
Trends
Associations
16
What we learned in last class Comparing IR to
databases
17
Search versus Discover
Search (goal-oriented)
Discover (opportunistic)
Structured Data
Data Mining
Data Retrieval
Unstructured Data (Text)
Text Mining
Information Retrieval
18
Data Retrieval
  • Find records within a structured database.

19
Information Retrieval
  • Find relevant information in an unstructured
    information source (usually text)

20
Data Mining
  • Discover new knowledge through analysis of data

21
Text Mining
  • Discover new knowledge through analysis of text

22
Outline
  • Text Mining Tools, Techniques, and Applications
  • Motivations and Example Applications
  • Text Mining Defined
  • Foundations of Text Mining
  • Case Study on Relation Detection K-Nearest
    Neighbor

23
Challenges of Text Mining
  • Very high number of possible dimensions
  • All possible word and phrase types in the
    language!!
  • Noisy data
  • Example Spelling mistakes
  • Not well structured text
  • Chat rooms
  • r u available ?
  • Hey whazzzzzz up
  • Speech
  • Complex and subtle relationships between concepts
    in text
  • AOL merges with Time-Warner
  • Time-Warner is bought by AOL
  • Ambiguity and context sensitivity
  • automobile car vehicle Toyota
  • Apple (the company) or apple (the fruit)

24
Text mining process
25
Text mining process
  • Text preprocessing
  • Syntactic/Semantic text analysis
  • Features Generation
  • Bag of words
  • Features Selection
  • Simple counting
  • Statistics
  • Text/Data Mining
  • Classification
  • - Supervised learning
  • Clustering-
  • Unsupervised learning
  • Analyzing results

26
Text Processing
  • Statistical Analysis
  • Quantify text data
  • Language or Content Analysis
  • Identifying structural elements
  • Extracting and codifying meaning
  • Reducing the dimensions of text data

27
Name Tagging
GPE
X
PER
PER
PER
George
W.
Bush
discussed
Iraq
ltPERgtGeorge W. Bushlt/PERgt discussed
ltGPEgtIraqlt/GPEgt
28
Feature selection
  • If we just use bag of words, high-cost
  • Reduce dimensionality
  • Learners have difficulty addressing tasks with
    high dimensionality
  • Irrelevant features
  • Not all features help!
  • e.g., the existence of a noun in a news article
    is unlikely to help classify it as politics or
    sport

29
Feature Selection Example
GPE
X
PER
PER
PER
ltGeorge, Capitalizedgt
ltW., Initialgt
ltdiscussed, Othergt
ltBush, Capitalizedgt
ltIraq, Capitalizedgt
ltPERgtGeorge W. Bushlt/PERgt discussed
ltGPEgtIraqlt/GPEgt
30
Outline
  • Text Mining Tools, Techniques, and Applications
  • Motivations and Example Applications
  • Text Mining Defined
  • Foundations of Text Mining
  • Case Study on Relation Detection K-Nearest
    Neighbor
  • Project User Interface
  • Project Documentation
  • Sample Project Analysis

31
Case Study Relation Detection
  • relation a semantic relationship between two
    entities
  • ACE relation type example
  • Agent-Artifact Rubin Military Design, the
    makers of the Kursk
  • Discourse each of whom
  • Employment/ Membership Mr. Smith, a senior
    programmer at Microsoft
  • Place-Affiliation Salzburg Red Cross officials
  • Person-Social relatives of the dead
  • Physical a town some 50 miles south of
    Salzburg
  • Other-Affiliation Republican senators

32
Training Data for Relation Detection
Train Sample Employment
Train Sample Employment
the secretary of NIST
the previous president of the United States
Train Sample Located
his ranch in texas
US forces in Bahrain
Connecticuts governer
Train Sample Located
Train Sample Employment
33
How would a Decision Tree do this?
No
34
K Nearest Neighbor (KNN) Model
pm
p2
p1
Distances of its nearest neighbors r1, r2, ,
  • KNN find the k nearest neighbors of an object

35
K Nearest Neighbor (KNN) Model
Train Sample
Train Sample
Test Sample
Train Sample
Train Sample
Train Sample
K3
36
K-nearest Neighbor for Relation Detection
Train Sample Employment
Train Sample Employment
the secretary of NIST
the previous president of the United States
Test Sample
Train Sample Located
the president of the United States
his ranch in texas
US forces in Bahrain
Connecticuts governer
Train Sample Located
Train Sample Employment
37
Distance Counting for Relation Detection
Train Sample Employment
Train Sample Employment
the secretary of NIST
the previous president of the United States
36
Test Sample
0
Train Sample Physical
46
the president of the United States
his ranch in texas
46
26
US forces in Bahrain
Connecticuts governer
Train Sample Employment
Train Sample Physical
  • If the heads of the mentions dont match 8
  • If the entity types of the heads of the mentions
    dont match 20
  • If the intervening words dont match 10

38
Syntactic / Semantic text analysis
  • Part Of Speech (pos) tagging
  • Find the corresponding pos for each word
  • e.g., John (noun) gave (verb) the (det) ball
    (noun)
  • 98 accurate.
  • Word sense disambiguation
  • Context based or proximity based
  • Very accurate
  • Parsing
  • Generates a parse tree (graph) for each sentence
  • Each sentence is a stand alone graph

39
Feature Generation Bag of words
  • Text document is represented by the words it
    contains (and their occurrences)
  • e.g., Lord of the rings ? the, Lord,
    rings, of
  • Highly efficient
  • Makes learning far simpler and easier
  • Order of words is not that important for certain
    applications
  • Stemming identifies a word by its root
  • e.g., flying, flew ? fly
  • Reduce dimensionality
  • Stop words The most common words are unlikely to
    help text mining
  • e.g., the, a, an, you
Write a Comment
User Comments (0)
About PowerShow.com