Text - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Text

Description:

... geographic data and medical & satellite images ... Multimedia data: images, audio, & video. Time-series data: for example banking data and stock exchange data ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 37

Provided by: sigurduro5

Category:

Tags: text

more less

Transcript and Presenter's Notes

Title: Text

1
Text Web Mining
2
Structured Data

So far we have focused on mining from structured
data

Attribute ? Value Attribute ? Value Attribute ?
Value ? Attribute ? Value
Outlook ? Sunny Temperature ? Hot Windy ?
Yes Humidity ? High Play ? Yes
Most data mining involves such data
3
Complex Data Types

Increased importance of complex data
Spatial data includes geographic data and
medical satellite images
Multimedia data images, audio, video
Time-series data for example banking data and
stock exchange data
Text data word descriptions for objects
World-Wide-Web highly unstructured text and
multimedia data

4
Text Databases

Many text databases exist in practice
News articles
Research papers
Books
Digital libraries
E-mail messages
Web pages
Growing rapidly in size and importance

5
Semi-Structured Data

Text databases are often semi-structured
Example
Title
Author
Publication_Date
Length
Category
Abstract
Content

6
Handling Text Data

Modeling semi-structured data
Information Retrieval (IR) from unstructured
documents
Text mining
Compare documents
Rank importance relevance
Find patterns or trends across documents

7
Information Retrieval

IR locates relevant documents
Key words
Similar documents
IR Systems
On-line library catalogs
On-line document management systems

8
Performance Measure

Two basic measures

Retrieved documents
Relevant documents
Relevant retrieved
All documents
9
Retrieval Methods

Keyword-based IR
E.g., data and mining
Synonymy problem a document may talk about
knowledge discovery instead
Polysemy problem mining can mean different
things
Similarity-based IR
Set of common keywords
Return the degree of relevance
Problem what is the similarity of data mining
and data analysis

10
Modeling a Document

Set of n documents and m terms
Each document is a vector v in Rm
The j-th coordinate of v measures the association
of the j-th term
Here r is the number of occurrences of the j-th
term and R is the number of occurrences of any
term.

11
Frequency Matrix
12
Similarity Measures
Dot product

Cosine measure

Norm of the vectors
13
Example

Google search for association mining
Two of the documents retrieved
Idaho Mining Association mining in Idaho (doc 1)
Scalable Algorithms for Association mining (doc
2)
Using only the two terms

14
New Model

Add the term data to the document model

15
Frequency Matrix
Will quickly become large
16
Association Analysis

Collect set of keywords frequently used together
and find association among them
Apply any association rule algorithm to a
database in the format
document_id, a_set_of_keywords

17
Document Classification

Need already classified documents as training set
Induce a classification model
Any difference from before?

A set of keywords associated with a document has
no fixed set of attributes or dimensions
18
Association-Based Classification

Classify documents based on associated,
frequently occurring text patterns
Extract keywords and terms with IR and simple
association analysis
Create a concept hierarchy of terms
Classify training documents into class
hierarchies
Use association mining to discover associated
terms to distinguish one class from another

19
Remember Generalized Association Rules
Taxonomy
Ancestor of shoes and hiking boots
Clothes
Footwear
Outerwear
Shirts
Shoes
Hiking Boots
Jackets
Ski Pants
Generalized association rule X? Y where no item
in Y is an ancestor of an item in X
20
Classifiers

Let X be a set of terms
Let Anc (X) be those terms and their ancestor
terms
Consider a rule X?? C and document d
If X ? Anc (d) then X?? C covers d
A rule that covers d may be used to classify d
(but only one can be used)

21
Procedure

Step 1 Generate all generalized association
rules , where X is a set of terms and C is a
class, that satisfy minimum support.
Step 2 Rank the rules according to some rule
ranking criterion
Step 3 Select rules from the list

22
Web Mining

The World Wide Web may have more opportunities
for data mining than any other area
However, there are serious challenges
It is too huge
Complexity of Web pages is greater than any
traditional text document collection
It is highly dynamic
It has a broad diversity of users
Only a tiny portion of the information is truly
useful

23
Search Engines ? Web Mining

Current technology search engines
Keyword-based indices
Too many relevant pages
Synonymy and polysemy problems
More challenging web mining
Web content mining
Web structure mining
Web usage mining

24
Web Content Mining
25
Example Classification of Web Documents

Assign a class to each document based on
predefined topic categories
E.g., use Yahoo!s taxonomy and associated
documents for training
Keyword-based document classification
Keyword-based association analysis

26
Web Structure Mining
27
Authoritative Web Pages

High quality relevant Web pages are termed
authoritative
Explore linkages (hyperlinks)
Linking a Web page can be considered an
endorsement of that page
Those pages that are linked frequently are
considered authoritative
(This has its roots back to IR methods based on
journal citations)

28
Structure via Hubs

A hub is a set of Web pages containing
collections of links to authorities
There is a wide variety of hubs
Simple list of recommended links on a persons
home page
Professional resource lists on commercial sites

29
HITS

Hyperlink-Induced Topic Search (HITS)
Form a root set of pages using the query terms in
an index-based search (200 pages)
Expand into a base set by including all pages the
root set links to (1000-5000 pages)
Go into an iterative process to determine hubs
and authorities

30
Calculating Weights

Authority weight
Hub weight

Page p is pointed to by page q
31
Adjacency Matrix

Lets number the pages 1,2,,n
The adjacency matrix is defined by
By writing the authority and hub weights as
vectors we have

32
Recursive Calculations

We now have
By linear algebra theory this converges to the
principle eigenvectors of the the two matrices

33
Output

The HITS algorithm finally outputs
Short list of pages with high hub weights
Short list of pages with high authority weights
Have not accounted for context

34
Applications

The Clever Project at IBMs Almaden Labs
Developed the HITS algorithm
Google
Developed at Stanford
Uses algorithms similar to HITS (PageRank)
On-line version

35
Web Usage Mining
36
Complex Data Types Summary

Emerging areas of mining complex data types
Text mining can be done quite effectively,
especially if the documents are semi-structured
Web mining is more difficult due to lack of such
structure
Data includes text documents, hypertext
documents, link structure, and logs
Need to rely on unsupervised learning, sometimes
followed up with supervised learning such as
classification