Text Mining: Finding Nuggets in Mountains of Textual Data - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Text Mining: Finding Nuggets in Mountains of Textual Data

Description:

Mining Within a Document: Feature Extraction ... Text Mining Applications. Main Advantages of mining technology over traditional information broker' ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 38

Provided by: jakeh3

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining: Finding Nuggets in Mountains of Textual Data

1
Text Mining Finding Nuggets in Mountains of
Textual Data

Jochen Dörre, Peter Gerstl, and Roland Seiffert

2
Overview

Introduction to Mining Text
How Text Mining differs from data mining
Mining Within a Document Feature Extraction
Mining in Collections of Documents Clustering
and Categorization
Text Mining Applications
Exam Questions/Answers

3
Introduction to Mining Text
4
Reasons for Text Mining
Reasons for Text Mining
5
Corporate Knowledge Ore

Email
Insurance claims
News articles
Web pages
Patent portfolios

Customer complaint letters
Contracts
Transcripts of phone calls with customers
Technical documents

6
Challenges in Text Mining

Information is in unstructured textual form.
Not readily accessible to be used by computers.
Dealing with huge collections of documents

7
Two Mining Phases

Knowledge Discovery Extraction of codified
information (features)
Information Distillation Analysis of the feature
distribution

8
How Text Mining Differs from Data Mining
9
Comparison of Procedures

Data Mining
Identify data sets
Select features
Prepare data
Analyze distribution

Text Mining
Identify documents
Extract features
Select features by algorithm
Prepare data
Analyze distribution

10
IBM Intelligent Miner for Text

SDK Software Development Kit
Contains necessary components for real text
mining
Also contains more traditional components
IBM Text Search Engine
IBM Web Crawler
drop-in Intranet search solutions

11
Mining Within a Document Feature Extraction
12
Feature Extraction

To recognize and classify significant vocabulary
items in unrestricted natural language texts.
Lets see an example

13
Example of Vocabulary found

Certificate of deposit
CMOs
Commercial bank
Commercial paper
Commercial Union Assurance
Commodity Futures Trading Commission
Consul Restaurant
Convertible bond
Credit facility
Credit line

Debt security
Debtor country
Detroit Edison
Digital Equipment
Dollars of debt
End-March
Enserch
Equity warrant
Eurodollar

14
Implementation of Feature Extraction relies on

Linguistically motivated heuristics
Pattern matching
Limited amounts of lexical information, such as
part-of-speech information.
Not used huge amounts of lexicalized information
Not used in-depth syntactic and semantic
analyses of texts

15
Goals of Feature Extraction

Very fast processing to be able to deal with mass
data
Domain-independence for general applicability

16
Extracted information categories

Names of persons, organizations and places
Multiword terms
Abbreviations
Relations
Other useful stuff

17
Canonical Forms

Normalized forms of dates, numbers,
Allows applications to use information very
easily
Abstracts from different morphological variants
of a single term

18
Canonical Names
President Bush Mr. Bush George Bush
Canonical Name George Bush

The canonical name is the most explicit, least
ambiguous name constructed from the different
variants found in the document
Reduces ambiguity of variants

19
Disambiguating Proper Names Nominator Program
20
Principles of Nominator Design

Apply heuristics to strings, instead of
interpreting semantics.
The unit of context for extraction is a document.
The unit of context for aggregation is a corpus.
The heuristics represent English naming
conventions.

21
Mining in Collections of Documents Clustering
and Categorization
22
1. Clustering

Partitions a given collection into groups of
documents similar in contents, i.e., in their
feature vectors.
Two clustering engines
Hierarchical Clustering tool
Binary Relational Clustering tool
Both tools help to identify the topic of a group
by listing terms or words that are common in the
documents in the group.
Thus, provides overview of the contents of a
collection of documents

23
Groups documents similar in their feature vectors
24
2. Categorization

Topic Categorization Tool
Assign documents to preexisting categories
(topics or themes)
Categories are chosen to match the intended use
of the collection
categories defined by providing a set of sample
documents for each category

25
2. Categorization (cont.)

This training phase produces a special index,
called the categorization schema
categorization tool returns a list of category
names and confidence levels for each document
If the confidence level is low, document is put
aside for human categorizer

26
2. Categorization (cont.)

Effectiveness
Tests have shown that the Topic Categorization
tool agrees with human categorizers to the same
degree as human categorizers agree with one
another.

27
Set of sample documents
Training phase
Returns list of category names and confidence
levels for each document
Special index used to categorize new documents
28
Text Mining Applications
29
Main Advantages of mining technology over
traditional information broker business

Ability to quickly process large amounts of
textual data
Objectivity and customizability
Automation

30
Applications used to

Gain insights about trends, relations between
people/places/organizations
Classify and organize documents according to
their content
Organize repositories of document-related
meta-information for search and retrieval
Retrieve documents

31
Main Applications

Knowledge Discovery
Information Distillation

32
CRI Customer Relationship Intelligence

Appropriate documents selected
Converted to common format
Feature extraction and clustering tools are used
to create a database
User may select parameters for preprocessing and
clustering step
Clustering produces groups of feedback that share
important linguistic elements
Categorization tool used to assign new incoming
feedback to identified categories.

33
CRI (continued)

Knowledge Discovery
Clustering used to create a structure that can be
interpreted
Information Distillation
Refinement and extension of the clustering
results
Interpreting the results
Tuning of the clustering process
Selecting meaningful clusters

34
Exam Question 1

Name an example of each of the two main classes
of applications of text mining.
Knowledge Discovery Discovering a common
customer complaint among much feedback.
Information Distillation Filtering future
comments into pre-defined categories

35
Exam Question 2

How does the procedure for text mining differ
from the procedure for data mining?
Adds feature extraction function
Not feasible to have humans select features
Highly dimensional, sparsely populated feature
vectors

36
Exam Question 3

In the Nominator program of IBMs Intelligent
Miner for Text, an objective of the design is to
enable rapid extraction of names from large
amounts of text. How does this decision affect
the ability of the program to interpret the
semantics of text?
Does not perform in-depth syntactic or semantic
analyses of texts

37
THE END
http//www-3.ibm.com/software/data/iminer/fortext/

Write a Comment

User Comments (0)