Title: Text Mining: Finding Nuggets in Mountains of Textual Data
1Text Mining Finding Nuggets in Mountains of
Textual Data
- Jochen Dörre, Peter Gerstl, and Roland Seiffert
2Overview
- Introduction to Mining Text
- How Text Mining differs from data mining
- Mining Within a Document Feature Extraction
- Mining in Collections of Documents Clustering
and Categorization - Text Mining Applications
- Exam Questions/Answers
3Introduction to Mining Text
4Reasons for Text Mining
Reasons for Text Mining
5Corporate Knowledge Ore
- Email
- Insurance claims
- News articles
- Web pages
- Patent portfolios
- Customer complaint letters
- Contracts
- Transcripts of phone calls with customers
- Technical documents
6Challenges in Text Mining
- Information is in unstructured textual form.
- Not readily accessible to be used by computers.
- Dealing with huge collections of documents
7Two Mining Phases
- Knowledge Discovery Extraction of codified
information (features) - Information Distillation Analysis of the feature
distribution
8How Text Mining Differs from Data Mining
9Comparison of Procedures
- Data Mining
- Identify data sets
- Select features
- Prepare data
- Analyze distribution
- Text Mining
- Identify documents
- Extract features
- Select features by algorithm
- Prepare data
- Analyze distribution
10IBM Intelligent Miner for Text
- SDK Software Development Kit
- Contains necessary components for real text
mining - Also contains more traditional components
- IBM Text Search Engine
- IBM Web Crawler
- drop-in Intranet search solutions
11Mining Within a Document Feature Extraction
12Feature Extraction
- To recognize and classify significant vocabulary
items in unrestricted natural language texts. - Lets see an example
13Example of Vocabulary found
- Certificate of deposit
- CMOs
- Commercial bank
- Commercial paper
- Commercial Union Assurance
- Commodity Futures Trading Commission
- Consul Restaurant
- Convertible bond
- Credit facility
- Credit line
- Debt security
- Debtor country
- Detroit Edison
- Digital Equipment
- Dollars of debt
- End-March
- Enserch
- Equity warrant
- Eurodollar
14Implementation of Feature Extraction relies on
- Linguistically motivated heuristics
- Pattern matching
- Limited amounts of lexical information, such as
part-of-speech information. - Not used huge amounts of lexicalized information
- Not used in-depth syntactic and semantic
analyses of texts
15Goals of Feature Extraction
- Very fast processing to be able to deal with mass
data - Domain-independence for general applicability
16Extracted information categories
- Names of persons, organizations and places
- Multiword terms
- Abbreviations
- Relations
- Other useful stuff
17Canonical Forms
- Normalized forms of dates, numbers,
- Allows applications to use information very
easily - Abstracts from different morphological variants
of a single term
18Canonical Names
President Bush Mr. Bush George Bush
Canonical Name George Bush
- The canonical name is the most explicit, least
ambiguous name constructed from the different
variants found in the document - Reduces ambiguity of variants
19Disambiguating Proper Names Nominator Program
20Principles of Nominator Design
- Apply heuristics to strings, instead of
interpreting semantics. - The unit of context for extraction is a document.
- The unit of context for aggregation is a corpus.
- The heuristics represent English naming
conventions.
21Mining in Collections of Documents Clustering
and Categorization
221. Clustering
- Partitions a given collection into groups of
documents similar in contents, i.e., in their
feature vectors. - Two clustering engines
- Hierarchical Clustering tool
- Binary Relational Clustering tool
- Both tools help to identify the topic of a group
by listing terms or words that are common in the
documents in the group. - Thus, provides overview of the contents of a
collection of documents
23Groups documents similar in their feature vectors
242. Categorization
- Topic Categorization Tool
- Assign documents to preexisting categories
(topics or themes) - Categories are chosen to match the intended use
of the collection - categories defined by providing a set of sample
documents for each category
252. Categorization (cont.)
- This training phase produces a special index,
called the categorization schema - categorization tool returns a list of category
names and confidence levels for each document - If the confidence level is low, document is put
aside for human categorizer
262. Categorization (cont.)
- Effectiveness
- Tests have shown that the Topic Categorization
tool agrees with human categorizers to the same
degree as human categorizers agree with one
another.
27Set of sample documents
Training phase
Returns list of category names and confidence
levels for each document
Special index used to categorize new documents
28Text Mining Applications
29Main Advantages of mining technology over
traditional information broker business
- Ability to quickly process large amounts of
textual data - Objectivity and customizability
- Automation
30Applications used to
- Gain insights about trends, relations between
people/places/organizations - Classify and organize documents according to
their content - Organize repositories of document-related
meta-information for search and retrieval - Retrieve documents
31Main Applications
- Knowledge Discovery
- Information Distillation
32CRI Customer Relationship Intelligence
- Appropriate documents selected
- Converted to common format
- Feature extraction and clustering tools are used
to create a database - User may select parameters for preprocessing and
clustering step - Clustering produces groups of feedback that share
important linguistic elements - Categorization tool used to assign new incoming
feedback to identified categories.
33CRI (continued)
- Knowledge Discovery
- Clustering used to create a structure that can be
interpreted - Information Distillation
- Refinement and extension of the clustering
results - Interpreting the results
- Tuning of the clustering process
- Selecting meaningful clusters
34Exam Question 1
- Name an example of each of the two main classes
of applications of text mining. - Knowledge Discovery Discovering a common
customer complaint among much feedback. - Information Distillation Filtering future
comments into pre-defined categories
35Exam Question 2
- How does the procedure for text mining differ
from the procedure for data mining? - Adds feature extraction function
- Not feasible to have humans select features
- Highly dimensional, sparsely populated feature
vectors
36Exam Question 3
- In the Nominator program of IBMs Intelligent
Miner for Text, an objective of the design is to
enable rapid extraction of names from large
amounts of text. How does this decision affect
the ability of the program to interpret the
semantics of text? - Does not perform in-depth syntactic or semantic
analyses of texts
37THE END
http//www-3.ibm.com/software/data/iminer/fortext/