Emerging Technologies of Text Mining - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Emerging Technologies of Text Mining

Description:

Summarization get the 'gist' of a document or document collection. ... Using Oracles Designer 2000, assist with Data Model maintenance and assist with ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 41
Provided by: fsktmU
Category:

less

Transcript and Presenter's Notes

Title: Emerging Technologies of Text Mining


1
Emerging Technologies of Text Mining Masrah
Azrifah Azmi Murad Department of Information
Systems
2
Outline
  • Introduction
  • Text vs. Data Mining
  • Motivation and challenges of TM
  • Text mining processes
  • Techniques in text mining
  • Some application areas
  • TM Commercial Tools

3
Definition
  • The non trivial extraction of implicit,
    previously unknown, and potentially useful
    information from (large amount of) textual data
  • Goal discover unknown information from any
    sources or documents (unstructured data) to form
    new facts
  • Cousin to data mining
  • E.g., e-mails, memos, reports, contracts, phone
    calls, and documents in the WWW

Source Automated Learning Group, NCSA
4
Search vs Discover
Search (goal-oriented)
Discover (opportunistic)
Structured Data
Data Mining
Data Retrieval
Unstructured Data (Text)
Text Mining
Information Retrieval
Source AvaQuest Inc, 2002
5
Data Retrieval
  • Find records within a structured database.

source Swanson and Smalheiser, 1994
6
Information Retrieval
  • Find relevant information in an unstructured
    information source (usually text)

source Swanson and Smalheiser, 1994
7
Data Mining
  • Discover new knowledge through analysis of data

source Swanson and Smalheiser, 1994
8
Text Mining
  • Discover new knowledge through analysis of text

source Swanson and Smalheiser, 1994
9
Motivation for Text Mining
  • Approximately 90 of the worlds data is held in
    unstructured formats (source Oracle Corporation)
  • Information intensive business processes demand
    that we transcend from simple document retrieval
    to knowledge discovery.

Structured Numerical or Coded Information
10
Unstructured or Semi-structured Information
90
source Swanson and Smalheiser, 1994
10
Challenges of Text Mining (1)
  • Large textual database
  • Web is growing
  • Publications are electronic
  • High dimensionality
  • Consider each word/phrase as a dimension
  • Complex and subtle relationships between concepts
    in text
  • AOL merges with Time-Warner
  • Time-Warner is bought by AOL

Source Automated Learning Group, NCSA
11
Challenges of Text Mining (2)
  • Ambiguity
  • Word ambiguity
  • Pronouns (he, she )
  • Synonyms (automobile car vehicle Proton)
  • Words with multiple meanings (bat baseball or
    mammal)
  • Semantic ambiguity
  • The king saw the rabbit with his glasses
    (multiple meanings)
  • Noisy data
  • Spelling mistakes
  • Abbreviations
  • Acronyms

Source Automated Learning Group, NCSA
12
Challenges of Text Mining (3)
  • Not well structured text
  • Email/Chat rooms
  • r u available ?
  • Hey whazzzzzz up
  • Speech
  • Order of words in the query
  • hot dog stand in the amusement park
  • hot amusement stand in the dog park

Source Automated Learning Group, NCSA
13
More Issues in Natural Language
  • Ambiguity, e.g., Squad helps dog bite victim
  • Anaphora (substitute for preceding word), e.g.,
    After Mary proposed to John, they found a
    preacher and got married For the honeymoon, they
    went to Hawaii
  • Indexicality (points to some state of affairs),
    e.g., I am over here Why did you do that?
  • Noncompositionality, e.g., baby shoes
  • Discourse structure speak or write longer than
    a sentence
  • Metonymy (using one noun phrase to stand for
    another), e.g., Chrysler announced record profits
  • Metaphor (way to describe something), e.g., Brian
    was a wall (bouncing every tennis ball back over
    the net)

14
Text Mining Process
  • Text Preprocessing
  • Syntactic/Semantic Text Analysis
  • Features Generation
  • Bag of Words
  • Feature Selection
  • Simple Counting
  • Statistics
  • Text/Data Mining
  • Classification- Supervised Learning
  • Clustering- Unsupervised Learning
  • Analyzing Results

15
Text Preprocessing Syntactic / Semantic Text
Analysis
  • Part of Speech (PoS) Tagging
  • Find the corresponding PoS for each word
  • e.g., John (noun) gave (verb) the (det) ball
    (noun)
  • Word Sense Disambiguation
  • Context based or proximity based
  • Parsing
  • plastic bottle holder / plastic bottle holder
  • Phrase Recognition/Collocations
  • Kuala Lumpur, interest rate

16
Feature Generation Bag of Words
  • Text document is represented by the words it
    contains (and their occurrences)
  • e.g., Lord of the rings ? the, Lord,
    rings, of
  • Highly efficient
  • Makes learning far simpler and easier
  • Order of words is not that important for certain
    applications
  • Stemming
  • Reduce dimensionality
  • Identifies a word by its root
  • e.g., flying, flew ? fly
  • Stop words
  • Identifies the most common words that are
    unlikely to help with text mining
  • e.g., the, a, an, you

17
Feature Selection
  • Reduce Dimensionality
  • Learners have difficulty addressing tasks with
    high dimensionality
  • Irrelevant Features
  • Not all features help!
  • Remove features that occur in only a few
    documents
  • Reduce features that occur in too many documents

18
Supervised vs. Unsupervised Learning
  • What Is Good Classification?
  • Correct classification
  • Known label of test example is identical to the
    predicted class from the model
  • Accuracy ratio
  • Percent of test set examples that are correctly
    classified by the model
  • Distance measure between classes can be used
  • e.g., classifying football document as a
    basketball document is not as bad as
    classifying it as crime
  • What Is Good Clustering?
  • Produce high quality clusters with
  • high intra-class similarity
  • low inter-class similarity

19
Classification Techniques
  • Neural networks
  • Decision trees
  • Bayesian classification
  • Nearest-neighbor
  • SVM
  • HMM

20
Clustering Techniques
  • k-Means
  • Fuzzy c-Means
  • Hierarchical clustering

21
Techniques in Text Mining (1)
  • Information extraction - find particular pieces
    of information from text documents determine
    relationships that hold between them
  • Thematic indexing - identify dominant theme for a
    particular group of documents by automatically
    extracting the key feature of the group. For
    example, documents about orange and apple -
    classified under fruit or plantation
  • Categorization - automatically organizes
    documents into user-defined categories or
    taxonomies 
  • Clustering groups together conceptually related
    documents, enabling identification of duplicates
    and near-duplicates

Source Sullivan, 2000
22
Techniques in Text Mining (2)
  • Summarization get the gist of a document or
    document collection. E.g., headlines of
    newspapers and TV news, previews of movies,
    bulletins of weather forecasts and minutes of a
    meeting
  • Foreign Language Text Mining -  enables an
    organization to make effective use of foreign
    language data, even in the absence of staff with
    foreign language skills
  • Topic Modeling - looks for patterns of words that
    tend to occur together in documents, then
    automatically categorizes those words into topics

23
Application Areas (1)
  • Information Retrieval
  • Indexing and retrieval of textual documents
  • Finding a set of (ranked) documents that are
    relevant to the query
  • Bioinformatics
  • Mining scientific journals for critical
    information associated with genes and proteins
    e.g., genes and their associated functions,
    diseases, and tissue
  • Email
  • Spam filtering

Source Automated Learning Group, NCSA
24
Application Areas (2)
  • News Feeds
  • Discover what is interesting provide summary
  • Homeland Security and Intelligence (US)
  • Analysis of terrorist networks Rapid
    identification of critical information about such
    topics as weapons of mass destruction from very
    large collections of text documents Surveillance
    of the Web, e-mails, or chat rooms.
  • Marketing
  • Discover distinct groups of potential buyers and
    make suggestions for other products

Source Automated Learning Group, NCSA
25
What is Information Extraction?
Advisory Programmer - Oracle (Austin, TX)
Response Code 1008-0074-97-iexc-jcn
Responsibilities This is an exciting opportunity
with Siemens Wireless Terminals a start-up
venture fully capitalized by a Global Leader in
Advanced Technologies. Qualified candidates will
Responsible for assisting with requirements
definition, analysis, design and implementation
that meet objectives, codes difficult and
sophisticated routines . Develops project plans,
schedules and cost data. Develop test plans and
implement physical design of databases. Develop
shell scripts for administrative and background
tasks, stored procedures and triggers. Using
Oracles Designer 2000, assist with Data Model
maintenance and assist with applications
development using Oracle Forms. Qualifications
BSCS, BSMIS or closely related field or related
equivalent knowledge normally obtained through
technical education programs. 5-8 years of
professional experience in development, system
design analysis, programming, installation using
Oracle development
  • Given
  • Source of textual documents
  • Well defined limited query (text based)
  • Find
  • Sentences with relevant information
  • Extract the relevant information and ignore
    non-relevant information (important!)
  • Link related information and output in a
    predetermined format

Source Automated Learning Group, NCSA
26
Extra-semantic Information
  • Extracting hidden meaning or sentiment based on
    use of language.
  • Examples
  • Customer is unhappy with their service!
  • Sentiment discontent
  • Sentiment is
  • Emotions fear, love, hate, sorrow
  • Feelings warmth, excitement
  • Mood, disposition, temperament,
  • Or even (someday)
  • Lies, sarcasm

source Swanson and Smalheiser, 1994
27
Web Mining - Content
  • Enormous wealth of textual information on the Web
  • Book/CD/Video stores (e.g., Amazon)
  • Restaurant information (e.g., Zagats)
  • Car prices (e.g., Carpoint)
  • Hyper-link information
  • Web is very dynamic
  • Web pages are constantly being generated
    (removed)
  • Web pages are generated from database queries
  • Technologies used
  • NLP
  • IR

Source Automated Learning Group, NCSA
28
Medical Research
  • Find causal links between symptoms or diseases
    and drugs or chemicals
  • E.g., Objective follow chains of causal
    implication to discover a relationship between
    migraines and biochemical levels
  • Data medical research papers, medical news
  • Key concept types - symptoms, drugs, diseases,
    chemicals

source Swanson and Smalheiser, 1994
29
Example Application
  • stress is associated with migraines
  • stress can lead to loss of magnesium
  • calcium channel blockers prevent some migraines
  • magnesium is a natural calcium channel blocker
  • spreading cortical depression (SCD) is implicated
    in some migraines
  • high levels of magnesium inhibit SCD
  • migraine patients have high platelet
    aggregability
  • magnesium can suppress platelet aggregability

source Swanson and Smalheiser, 1994
30
Business Applications
  • Ex 1 Decision Support in CRM
  • What are customers typical complaints?
  • What is the trend in the number of satisfied
    customers in Cleveland? NY?
  • Ex 2 Knowledge Management
  • People Finder
  • Ex 3 Personalization in eCommerce
  • Suggest products that fit a users interest
    profile (even based on personality info).

source Swanson and Smalheiser, 1994
31
Example 1 Decision Support using Bank Call
Center Data
  • The Needs
  • Analysis of call records as input into
    decision-making process of Banks management
  • Quick answers to important questions
  • Which offices receive the most angry calls?
  • What products have the fewest satisfied
    customers?
  • (Angry and Satisfied are recognizable
    sentiments)
  • User friendly interface and visualization tools

source Swanson and Smalheiser, 1994
32
Example 1 Decision Support using Bank Call
Center Data
  • The Information Source
  • Call center records
  • Example

AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK,
NY, H-SUPRVR8, STMT, mr stark has been with the
company for about 20 yrs. He hates his stmt
format and wishes that we would show a daily
balance to help him know when he falls below
the required balance on the account.
source Swanson and Smalheiser, 1994
33
Example 1 Call Volume by Sentiment
source Swanson and Smalheiser, 1994
34
Example 2KM People Finder
  • The Needs
  • Find people as well as documents that can address
    my information need
  • Promote collaboration and knowledge sharing
  • Leverage existing information access system
  • The Information Sources
  • Email, groupware, online reports,

source Swanson and Smalheiser, 1994
35
Example 2Simple KM People Finder
Ranked People Names
Name Extractor
Authority List
Query
Relevant Docs
Search or Navigation System
source Swanson and Smalheiser, 1994
36
Example 2 KM People Finder
source Swanson and Smalheiser, 1994
37
Example 3Personalized Movie Matcher
  • The Need
  • Match movies to individuals based on preference
    profile
  • The Information
  • Written reviews of movies
  • Users lists of favorite movies.

Sentiment Analysis
Movie Reviews
Typed and Tagged Reviews
source Swanson and Smalheiser, 1994
38
Sentiment Analysis of Movies Visualization
absurdity
Action
conflict
insecurity
1
Romance
crime
injustice
0
inferiority
death
deception
immorality
horror
destruction
fear
source Swanson and Smalheiser, 1994
39
Commercial Tools
  • IBM Intelligent Miner for Text
  • Semio Map
  • InXight LinguistX / ThingFinder
  • LexiQuest
  • ClearForest
  • Teragram
  • SRA NetOwl Extractor
  • Autonomy

source Swanson and Smalheiser, 1994
40
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com