Text Mining: Finding Nuggets in Mountains of Textual Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Text Mining: Finding Nuggets in Mountains of Textual Data

Description:

Mining Within a Document: Feature Extraction ... Text Mining Applications. Main Advantages of mining technology over traditional information broker' ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 38
Provided by: jakeh3
Category:

less

Transcript and Presenter's Notes

Title: Text Mining: Finding Nuggets in Mountains of Textual Data


1
Text Mining Finding Nuggets in Mountains of
Textual Data
  • Jochen Dörre, Peter Gerstl, and Roland Seiffert

2
Overview
  • Introduction to Mining Text
  • How Text Mining differs from data mining
  • Mining Within a Document Feature Extraction
  • Mining in Collections of Documents Clustering
    and Categorization
  • Text Mining Applications
  • Exam Questions/Answers

3
Introduction to Mining Text
4
Reasons for Text Mining
Reasons for Text Mining
5
Corporate Knowledge Ore
  • Email
  • Insurance claims
  • News articles
  • Web pages
  • Patent portfolios
  • Customer complaint letters
  • Contracts
  • Transcripts of phone calls with customers
  • Technical documents

6
Challenges in Text Mining
  • Information is in unstructured textual form.
  • Not readily accessible to be used by computers.
  • Dealing with huge collections of documents

7
Two Mining Phases
  • Knowledge Discovery Extraction of codified
    information (features)
  • Information Distillation Analysis of the feature
    distribution

8
How Text Mining Differs from Data Mining
9
Comparison of Procedures
  • Data Mining
  • Identify data sets
  • Select features
  • Prepare data
  • Analyze distribution
  • Text Mining
  • Identify documents
  • Extract features
  • Select features by algorithm
  • Prepare data
  • Analyze distribution

10
IBM Intelligent Miner for Text
  • SDK Software Development Kit
  • Contains necessary components for real text
    mining
  • Also contains more traditional components
  • IBM Text Search Engine
  • IBM Web Crawler
  • drop-in Intranet search solutions

11
Mining Within a Document Feature Extraction
12
Feature Extraction
  • To recognize and classify significant vocabulary
    items in unrestricted natural language texts.
  • Lets see an example

13
Example of Vocabulary found
  • Certificate of deposit
  • CMOs
  • Commercial bank
  • Commercial paper
  • Commercial Union Assurance
  • Commodity Futures Trading Commission
  • Consul Restaurant
  • Convertible bond
  • Credit facility
  • Credit line
  • Debt security
  • Debtor country
  • Detroit Edison
  • Digital Equipment
  • Dollars of debt
  • End-March
  • Enserch
  • Equity warrant
  • Eurodollar

14
Implementation of Feature Extraction relies on
  • Linguistically motivated heuristics
  • Pattern matching
  • Limited amounts of lexical information, such as
    part-of-speech information.
  • Not used huge amounts of lexicalized information
  • Not used in-depth syntactic and semantic
    analyses of texts

15
Goals of Feature Extraction
  • Very fast processing to be able to deal with mass
    data
  • Domain-independence for general applicability

16
Extracted information categories
  • Names of persons, organizations and places
  • Multiword terms
  • Abbreviations
  • Relations
  • Other useful stuff

17
Canonical Forms
  • Normalized forms of dates, numbers,
  • Allows applications to use information very
    easily
  • Abstracts from different morphological variants
    of a single term

18
Canonical Names
President Bush Mr. Bush George Bush
Canonical Name George Bush
  • The canonical name is the most explicit, least
    ambiguous name constructed from the different
    variants found in the document
  • Reduces ambiguity of variants

19
Disambiguating Proper Names Nominator Program
20
Principles of Nominator Design
  • Apply heuristics to strings, instead of
    interpreting semantics.
  • The unit of context for extraction is a document.
  • The unit of context for aggregation is a corpus.
  • The heuristics represent English naming
    conventions.

21
Mining in Collections of Documents Clustering
and Categorization
22
1. Clustering
  • Partitions a given collection into groups of
    documents similar in contents, i.e., in their
    feature vectors.
  • Two clustering engines
  • Hierarchical Clustering tool
  • Binary Relational Clustering tool
  • Both tools help to identify the topic of a group
    by listing terms or words that are common in the
    documents in the group.
  • Thus, provides overview of the contents of a
    collection of documents

23
Groups documents similar in their feature vectors
24
2. Categorization
  • Topic Categorization Tool
  • Assign documents to preexisting categories
    (topics or themes)
  • Categories are chosen to match the intended use
    of the collection
  • categories defined by providing a set of sample
    documents for each category

25
2. Categorization (cont.)
  • This training phase produces a special index,
    called the categorization schema
  • categorization tool returns a list of category
    names and confidence levels for each document
  • If the confidence level is low, document is put
    aside for human categorizer

26
2. Categorization (cont.)
  • Effectiveness
  • Tests have shown that the Topic Categorization
    tool agrees with human categorizers to the same
    degree as human categorizers agree with one
    another.

27
Set of sample documents
Training phase
Returns list of category names and confidence
levels for each document
Special index used to categorize new documents
28
Text Mining Applications
29
Main Advantages of mining technology over
traditional information broker business
  • Ability to quickly process large amounts of
    textual data
  • Objectivity and customizability
  • Automation

30
Applications used to
  • Gain insights about trends, relations between
    people/places/organizations
  • Classify and organize documents according to
    their content
  • Organize repositories of document-related
    meta-information for search and retrieval
  • Retrieve documents

31
Main Applications
  • Knowledge Discovery
  • Information Distillation

32
CRI Customer Relationship Intelligence
  • Appropriate documents selected
  • Converted to common format
  • Feature extraction and clustering tools are used
    to create a database
  • User may select parameters for preprocessing and
    clustering step
  • Clustering produces groups of feedback that share
    important linguistic elements
  • Categorization tool used to assign new incoming
    feedback to identified categories.

33
CRI (continued)
  • Knowledge Discovery
  • Clustering used to create a structure that can be
    interpreted
  • Information Distillation
  • Refinement and extension of the clustering
    results
  • Interpreting the results
  • Tuning of the clustering process
  • Selecting meaningful clusters

34
Exam Question 1
  • Name an example of each of the two main classes
    of applications of text mining.
  • Knowledge Discovery Discovering a common
    customer complaint among much feedback.
  • Information Distillation Filtering future
    comments into pre-defined categories

35
Exam Question 2
  • How does the procedure for text mining differ
    from the procedure for data mining?
  • Adds feature extraction function
  • Not feasible to have humans select features
  • Highly dimensional, sparsely populated feature
    vectors

36
Exam Question 3
  • In the Nominator program of IBMs Intelligent
    Miner for Text, an objective of the design is to
    enable rapid extraction of names from large
    amounts of text. How does this decision affect
    the ability of the program to interpret the
    semantics of text?
  • Does not perform in-depth syntactic or semantic
    analyses of texts

37
THE END
http//www-3.ibm.com/software/data/iminer/fortext/
Write a Comment
User Comments (0)
About PowerShow.com